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From  the  Sponsor 


Development  of  Safety-Critical  Software  Systems 


Since  the  development  of  the  digital  computer,  software  continues  to  play  an  impor¬ 
tant  and  evolutionary  role  in  the  operation  and  control  of  hazardous,  safety-critical 
functions.  The  reluctance  of  the  engineering  community  to  relinquish  human  control 
of  hazardous  operations  has  diminished  dramatically  in  the  last  1 5  years.  Today,  digital 
computer  systems  have  autonomous  control  over  safety-critical  functions  in  nearly 
every  major  technology,  both  commercially  and  within  government  systems.  This  rev¬ 
olution  is  due  primarily  to  the  ability  of  software  to  reliably  perform  critical  control 
tasks  at  speeds  unmatched  by  its  human  counterpart.  Other  factors  influencing  this  transition 
include  our  ever-growing  need  and  desire  for  increased  versatility,  greater  performance  capabil¬ 
ity,  higher  efficiency,  and  a  decreased  life-cycle  cost. 

In  most  instances,  software  can  meet  all  of  the  above  attributes  of  the  system’s  performance 
when  properly  designed.  The  logic  of  the  software  allows  for  decisions  to  be  implemented  with¬ 
out  emotion  and  with  speed  and  accuracy.  This  has  forced  the  human  operator  out  of  the  con¬ 
trol  loop  because  they  can  no  longer  keep  pace  with  the  speed,  cost  effectiveness,  and  decision 
making  process  of  the  system. 

According  to  the  MIL-STD-882D,  the  main  objective  (or  definition)  of  system  safety  engi¬ 
neering,  which  includes  safety-critical  software  systems,  is  “the  application  of  engineering  and 
management  principles,  criteria,  and  techniques  to  optimize  all  aspects  of  safety  within  the  con¬ 
straints  of  operations  effectiveness,  time,  and  cost  throughout  all  phases  of  the  system  life 
cycle.” 

The  ultimate  responsibility  for  the  development  of  a  “safe  system”  rests  with  program  man¬ 
agement.  The  commitment  to  provide  qualified  people  and  an  adequate  budget  and  schedule  for 
a  software  development  program  begins  with  the  program  director  or  manager.  Top  manage¬ 
ment  must  be  a  strong  voice  of  safety  advocacy  and  must  communicate  this  personal  commit¬ 
ment  to  each  level  of  program  and  technical  management.  Project  directors  or  managers  must 
support  the  integrated  safety  process  between  systems  engineering,  software  engineering,  and 
safety  engineering  in  the  design,  development,  testing,  and  operation  of  the  system  software. 

This  issue  of  CROSSTALK  provides  an  in-depth  look  at  the  implementation  and  develop¬ 
ment  of  safety-critical  software  systems.  It  also  explores  how  these  systems  will  likely  face 
unplanned  challenges  during  long-term  development,  requiring  developers  to  build  flexibility 
into  their  approaches. 

Authors  Dr.  Victor  Basili,  Kathleen  Dangle,  Linda  Esker,  Frank  Marotta,  and  Ioana  Rus 
guide  readers  through  their  methodology  for  developing  early  safety  measures  on  safety- critical 
software  system  projects  in  Measures  and  Risk  Indicators  for  Early  Insight  Into  Software  Safety. 

In  Safety  and  Security:  Certification  Issues  and  Technologies ,  Dr.  Benjamin  M.  Brosgol  analyzes  the 
two  primary  safety  and  security  standards — DO-178B  and  the  Common  Criteria — and  gives 
software  professionals  the  tools  to  avoid  hazards  and  vulnerabilities. 

First  responders  who  need  a  secure  and  mobile  coordination  and  communication  infra¬ 
structure  during  crisis  will  take  special  interest  in  Sugih  Jamin’s  WebBee:  A  P la  form  for  Secure 
Mobile  Coordination  and  Communication  in  Crisis  Scenarios. 

In  Constructing  Change-Tolerant  Systems  Using  Capability-Based  Design ,  Dr.  James  D.  Arthur  and 
Ramya  Ravichandar  recognize  the  need  for  flexibility  and  provide  readers  with  thought-pro¬ 
voking  ideas  on  how  a  capability-based  approach  may  be  the  answer  to  complex,  large-scale  sys¬ 
tems  that  are  hostile  to  change. 

And,  finally,  don’t  miss  DoD  Business  Mission  Area  Service-Oriented  Architecture  to  Support  Business 
Transformation  by  Dennis  E.  Wisnosky,  Dimitry  Feldshteyn,  Wil  Mancuso,  A1  (Edward)  Gough, 
Eric  J.  Riutort,  and  Paul  Strassman.  They  examine  whether  a  service-oriented  architecture  (SOA) 
is  the  best  fit  for  the  Department  of  Defense’s  (DoD’s)  Business  Mission  Area  (accounting  for 
roughly  half  of  the  DoD  Information  Technology  budget)  and  examine  the  DoD’s  SOA  vision. 

I  hope  you  enjoy  reading  CROSSTALK’S  variety  of  articles  on  how  to  better  approach  the 
development  of  safety- critical  software  systems.  I  certainly  did. 


Ken  Chirkis 

Naval  Air  Systems  Command 
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Software  contributes  an  ever-increasing  level  of  functionality  and  control  in  today's  systems.  This  increased  use  of  soft¬ 
ware  can  dramatically  increase  the  complexity  and  time  needed  to  evaluate  the  safety  of  a  system.  Although  the  actual 
system  safety  cannot  be  verified  during  its  development,  measures  can  reveal  early  insights  into  potential  safety  problems 
and  risks.  An  approach  for  developing  early  software  safety  measures  is  presented  in  this  article.  The  approach  and  the 
example  software  measures  presented  are  based  on  experience  working  with  the  safety  engineering  group  on  a  large 
Department  of  Defense  program. 


The  purpose  of  the  system  safety 
process  is  to  identify  and  mitigate  haz¬ 
ards  associated  with  the  operation  and 
maintenance  of  a  system  under  develop¬ 
ment.  System  safety  is  often  implemented 
through  an  approach  that  identifies  haz¬ 
ards  and  defines  actions  that  will  mitigate 
the  hazard  and  verify  that  the  mitigations 
have  been  implemented.  The  residual  risk  is 
the  risk  remaining  when  a  hazard  cannot 
be  completely  mitigated.  The  goal  of  the 
system  safety  process  is  to  reduce  this 
residual  risk  to  an  acceptable  level,  as 
defined  by  the  safety  certifier.  Cost  is  a 
consideration  in  determining  the  level  of 
acceptable  residual  risk. 

As  software  contributes  an  ever- 
increasing  level  of  functionality  and  con¬ 
trol  in  today’s  systems,  the  system  safety 
process  must  scrutinize  software-specific 
components  of  the  system.  Software  can 
contribute  to  system  safety  as  both  a  haz¬ 
ard  source  and  hazard  mitigation. 
Software  is  not  intrinsically  hazardous  but 
it  plays  a  role  in  safety  in  many  systems 
when  it: 

•  Causes  hardware  to  perform  unsafe 
actions. 

•  Directs  an  operator  to  perform  unsafe 
actions. 

•  Guides  an  operator  to  make  unsafe 
decisions. 

•  Mitigates  hazards. 

In  this  article,  we  define  a  measure¬ 
ment  approach  that  provides  early  visibili¬ 
ty  into  the  implementation  of  the  software 
safety  hazard  process,  assessing  the  level 
of  consistency  and  discipline  that  is 
applied  to  the  process  for  identifying  and 
mitigating  software-related  hazards.  Early 
process  visibility  assists  safety  engineers  in 
detecting  breakdowns  in  the  process,  ask¬ 
ing  the  right  kinds  of  questions,  and  mak¬ 
ing  timely  decisions  that  will  improve  the 


resulting  system  safety.  This  early  visibility 
is  important  as  mitigations  typically  affect 
system  requirements  and  design;  making 
these  decisions  late  in  the  system  develop¬ 
ment  lifecycle  can  be  cost-prohibitive.  The 
proposed  measurement  approach  identi¬ 
fies  risks  resulting  from  the  application  of 

“Early  process  visibility 
assists  safety  engineers 
in  detecting  breakdowns 
in  the  process ,  asking  the 
right  kinds  of  questions, 
and  making  timely 
decisions  that  will 
improve  the  resulting 
system  safety.” 


the  safety  hazard  analysis  process  (or  lack 
thereof)  by  performing  process  checks 
and  assesses  the  potential  for  achieving  a 
safe  system.  It  is  important  to  note  that 
this  approach  does  not  provide  for  an 
evaluation  of  the  system’s  safety. 

This  article  begins  by  defining  terms 
and  documenting  our  assumptions.  We 
then  describe  our  approach  for  defining 
specific  safety  measures  in  the  context  of 
an  existing  environment  and  provide  some 
examples. 

Terminology  and  Key  Concepts 

A  hazard  is  any  real  or  potential  condition 
that  can  cause  injury,  illness,  or  death  to 


personnel;  damage  to  or  loss  of  a  system, 
equipment,  or  property;  or  damage  to  the 
environment.  Key  terms  associated  with 
hazards  and  their  management  are: 

•  Causes.  What  can  make  the  hazard 
occur? 

•  Controls.  Mitigation  actions  whose 
purpose  is  to  minimize  the  chances  of 
a  hazard  occurring. 

•  Verifications.  Some  assurance,  like 
safety  test  cases,  that  the  hazard  has 
been  controlled. 

A  hazard  is  open  if  at  least  one  of  its 
causes  is  open;  a  cause  is  open  if  at  least 
one  of  its  controls  is  open;  a  control  is 
open  if  at  least  one  of  its  verifications  is 
open.  A  hazard  is  closed  when  all  of  the 
controls  for  all  of  its  causes  have  been 
implemented  and  verified. 

A  safety -related  requirement  is  a  require¬ 
ment  whose  purpose  is  to  control  a  haz¬ 
ard.  One  hazard  might  be  addressed  by 
several  requirements  (e.g.,  one  hazard  may 
affect  several  parts  of  the  system),  or  one 
requirement  might  address  several  hazards 
(e.g.,  a  central  control  or  communication 
system  may  mitigate  hazards  from  multi¬ 
ple  nodes). 

A  hazard  tracking  system  (HTS)  is  a 
repository  of  identified  system  hazards 
and  their  associated  causes,  controls,  and 
verifications.  Within  the  HTS,  causes 
should  be  related  with  the  system  element 
causing  the  hazard,  controls  should  be 
related  with  the  requirement (s)  controlling 
or  mitigating  the  hazard,  and  verifications 
should  be  related  with  the  hazard  cause 
and  the  test  verifying  that  the  hazard  is 
controlled. 

A  hazard  is  defined  as  a  software-relat¬ 
ed  hazard  if  it  has  at  least  one  software 
cause  or  one  software  control.  A  software 
safety-related  requirement  is  a  software 
requirement  that  can  create  or  contribute 
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Figure  1 :  Example  of  a  Hazard,  Cause,  Control,  and  Verification 


to  a  hazard  in  the  system  or  is  defined  to 
control  or  mitigate  a  hazard. 

An  example  of  a  system  hazard 
description  that  has  a  software-related 
cause  is  as  follows: 

•  Accident /Mishap.  Undesired  and  un¬ 
planned  event  that  results  in  a  speci¬ 
fied  level  of  loss  (e.g.,  two  planes  col¬ 
lide). 

•  Hazard/Description.  State  that  leads 
to  an  accident  (e.g.,  guidance  system 
may  malfunction). 

•  Hazard  Cause.  The  action  causing 
the  hazard  to  occur  (e.g.,  a  miscalcula¬ 
tion  of  the  projected  trajectory). 

•  Hazard  Control  or  Safety  Require¬ 
ment.  Mitigation  via  a  requirement  or 
set  of  requirements  whose  purpose  is 
to  minimize  the  chances  of  a  hazard 
(e.g.,  multiple  computations  of  the 
projected  trajectory  are  computed  and 
polled). 

•  Verification.  An  assurance  that  the 
hazard  has  been  controlled  (e.g.,  safety 
test  cases). 

Figure  1  provides  an  illustration  of  the 
context  for  this  example. 

Several  assumptions  are  made:  (1)  all 
hazards  should  be  recorded  in  an  HTS; 
(2)  hazards  are  retired  or  have  their  asso¬ 
ciated  risk  reduced  over  time,  but  do  not 
leave  the  HTS;  and  (3)  closed  hazards  can 
become  open  hazards  when  a  new  cause 
is  found.  Although  the  approach  does  not 
prescribe  a  particular  management  or 
organizational  structure,  it  is  assumed 
that  the  safety  and  project  organizations 
communicate  and  collaborate  effectively 
in  both  evolving  requirements  and  verify¬ 
ing  mitigations.  As  the  safety  hazard 
analysis  will  impact  requirements,  design, 
code,  and  tests,  it  is  assumed  that  the 
standard  processes  defined  by  the  project 
for  change  management  apply  to  artifacts 
impacted  by  safety  hazard  analysis. 

The  level  of  rigor  (LoR)  is  the  amount 
of  requirements  analysis,  development 
discipline,  testing,  and  configuration  con¬ 
trol  required  to  mitigate  the  potential 
safety  risks  of  the  software  component 
[1].  Each  software  component  should  be 
assessed  and  assigned  an  LoR  for  devel¬ 
opment.  This  refers  to  any  mechanism 
put  in  place  to  treat  specific  requirements 
with  special  treatment,  giving  a  piece  of 
software  higher  levels  of  safety  assurance 
and  providing  users  higher  confidence 
through  greater  discipline  and  process. 

A  safety-related  defect  is  a  defect  that 
refers  to  a  failure  to  comply  with  a  safety 
requirement,  an  unexpected  behavior  that 
affects  safety,  or  the  recognition  that  a 
control  has  not  been  defined/implement¬ 
ed/verified.  Safety-related  defects  should 


be  traceable  to  one  or  more  hazards  or  5. 
may  generate  new  hazards.  Defects  can 
be  counted  directly  or  they  can  be 
weighed  by  the  set  of  related  require¬ 
ments  or  hazards  they  affect.  A  software 
defect  tracking  system  (i.e.,  tool / database  to 
capture  software  defects  identified  during 
testing)  is  used  as  the  source  of  safety- 
related  software  defects. 

Gaining  Software  Safety 
Visibility 

Our  goal  in  applying  the  proposed  mea¬ 
surement  approach  is  to  provide  software 
safety  engineers  visibility  into  the  software 
safety  process  and  to  assist  them  in  mak¬ 
ing  judgments  about  the  software  safety 
process  implementation  and  its  execution. 

We  identified  five  needs,  and  an  associated 
inquiry  area  for  each  was  defined: 

1.  Software  Safety  Analysis  Process. 
Confirm  that  system  and  software 
requirements  and  development  prac¬ 
tices  are  in  compliance  with  safety 
processes. 

2.  Hazard  and  Mitigation  Identifi¬ 
cation.  Ensure  that  the  program  is 
adequately  identifying  and  document¬ 
ing  the  appropriate  information  about 
a  hazard  (i.e.,  hazards,  causes,  and 
controls  as  defined  by  the  software 
safety  analysis  process). 

3.  Hazard  Monitoring.  Ensure  that 
sufficient  actions  are  taken  by  analyz¬ 
ing  and  monitoring  hazard  causes, 
controls,  and  verifications  over  time 
(i.e.,  are  the  hazard  controls  being 
implemented  and  verified?). 

4.  Appropriate  LoR  for  Software 
Safety.  Balance  risk  with  the  cost  of 
safety  by  identifying  the  appropriate 
software  development  LoR. 


Safety-Related  Defects.  Identify 
whether  any  safety  problems  remain 
in  the  system  for  the  safety  assess¬ 
ment  reports  by  identifying  all  out¬ 
standing  safety- related  defects. 

For  each  area,  readiness  and  visibility 
measures  are  defined,  specifying  different 
measurement  details.  A  readiness  assessment 
provides  a  preliminary  view  into  the  state 
of  the  safety  process  for  software  and 
checks  that  the  data  needed  for  the  sec¬ 
ond  type  of  measurement  is  available. 
Software  safety  visibility  digs  deeper  by 
defining  models,  measures,  and  interpre¬ 
tations  that  provide  information  on  the 
implementation  of  safety  practices  (or 
lack  thereof)  and  points  to  safety-related 
risks  and  issues. 

To  minimize  the  overhead  associated 
with  data  collection  and  analysis,  a  com¬ 
bination  of  a  top-down  goal/ques¬ 
tion/metric  analysis  and  a  bottom-up 
inventory  of  the  data  already  collected  by 
the  organization  is  used  to  identify  the 
measures  that  will  be  cost-effective  and 
address  management  needs  [2]. 

For  example,  to  address  software  safe¬ 
ty  analysis,  an  investigation  may  be  per¬ 
formed  to  determine  whether  there  is  a 
documented  safety  process  that  identifies 
requirements  as  safety-related  and 
records  that  information  in  the  require¬ 
ments  repository.  If  this  is  not  true,  then 
the  program  may  have  a  problem  and  fur¬ 
ther  measures  that  assume  counting  the 
number  of  safety-related  requirements 
cannot  be  utilized.  A  sample  set  of  key 
questions  addressing  the  five  inquiry 
areas  for  the  readiness  assessment  are 
shown  in  Table  1  (see  next  page).  All 
readiness  questions  must  be  answered  Yes 
to  indicate  that  the  appropriate  measure- 


October  2008 


www.stsc.hill.af.mil  5 


Development  of  Fault-Tolerant  Systems 


Inquiry  Area 

Readiness  Assessment  Questions 

Software  Safety  Analysis 
Process 

o  Is  there  a  documented  software  safety  process  that 
identifies  requirements  as  safety-related? 
o  Are  safety-related  software  requirements  marked  as 
such  in  the  requirements  repository? 

Hazard  and  Mitigation 
Identification 

o  Is  there  an  (automated)  HTS  where  software-related 
hazards,  causes,  controls,  and  verifications  are 
recorded  (and  can  be  counted)? 

Hazard  Monitoring 

o  Are  hazards  mapped  back  to  their  source 
(requirements)  and  controls  mapped  to 
requirements? 

o  Are  all  the  fields  being  entered  into  the  HTS? 

Appropriate  LoR  for 
Software  Safety 

o  Are  the  various  levels  of  rigor  identified  and  is  the 
distribution  rational? 

Safety  Defects 

o  Are  software  safety-related  failures/faults  identified  as 
such  in  the  software  defect  tracking  system? 
o  Are  safety-related  test  cases  identified  as  such? 
o  Are  defect  closures  recorded? 

Table  1:  Readiness  Assessment  Questions 


ments  can  be  gathered.  No  answers  pro¬ 
vide  an  early  warning  that  software  safety 
may  not  be  properly  addressed.  In  this 
case,  the  recommended  action  is  to  iden¬ 
tify  why  the  data  is  not  available  (root 
cause)  and  take  an  appropriate  corrective 
action.  The  questions  in  Table  1  address 
problems  in  dealing  with  safety  in  general 
and  software  safety  in  particular. 

While  these  data  readiness  questions 
seem  simplistic,  they  can  uncover  a  host 
of  issues  that  may  not  be  obvious  unless 


the  questions  are  asked  explicitly.  These 
questions  expose  some  common  prob¬ 
lems  in  implementing  a  useable,  cost- 
effective  HTS  and  in  the  overall  hazard 
tracking  approach: 

•  Software  Hazard  Identification. 

Safety-related  requirements  are  not 
identified  as  such  and  hazard  controls 
are  not  identified  as  software-related 
safety  requirements  if  they  are.  This 
can  demonstrate  inadequate  attention 
to  software  safety. 


•  Hazard  Traceability.  The  HTS  does 
not  provide  sufficient  linkages  among 
the  requirements  documentation  sys¬ 
tem,  the  test  plan,  or  to  the  defect 
tracking  system.  Hazards  must  be  bi¬ 
directionally  traceable  to  require¬ 
ments,  tests,  and  defects  in  order  to 
verify  complete  coverage,  determine 
comprehensiveness  of  the  hazard 
analysis,  and  ensure  that  the  hazard 
data  represents  the  system  accurately 
over  time. 

•  Data  Integrity.  Hazards,  causes,  and 
controls  may  not  be  described  in  suf¬ 
ficient  detail  to  be  understood  and 
verified.  The  information  in  the  HTS 
must  be  accurate,  clear,  and  specific  in 
order  to  understand  and  track  hazards 
throughout  the  development  and 
deployment  of  the  system. 

•  LoR.  There  may  be  difficulty  in  dif¬ 
ferentiating  among  different  levels  of 
rigor  for  the  various  software  safety 
requirements  and  identifying,  assign¬ 
ing,  and  tracking  the  appropriate  LoR 
to  specific  software  components  that 
implement  the  safety-related  require¬ 
ment.  Lack  of  proper  LoR  differenti¬ 
ation  can  lead  to  inadequate  attention 
on  high-risk  hazards  or  too  much 
attention  on  low-risk  hazards. 
Additionally,  the  trade-off  between 
higher  levels  of  rigor  and  their  associ¬ 
ated  higher  costs  must  be  considered 
in  order  to  assess  the  right  balance  of 
LoR  distribution.  An  LoR  should  be 
assigned  and  traceable  from  require¬ 
ments  through  design  to  code. 

Many  HTS  problems  are  caused  by  an 

inadequate  vision  for  the  use  of  the  HTS, 
such  as  when  it  is  viewed  as  a  storage 
repository  rather  than  an  analysis  tool.  It 
is  important  to  make  sure  that  (1)  the 
HTS  has  adequate  functionality,  quality 
checks,  and  documentation;  (2)  there  is 
traceability  and  synchronization  among 
the  various  support  systems  (e.g.,  the 
HTS  and  the  requirements  management 
system  and  the  defect  tracking  system); 
and  (3)  the  quality  of  the  data  is  moni¬ 
tored  to  minimize  the  need  to  scrub  the 
data  later  on.  The  cost  of  not  adhering  to 
this  advice  is  high  rework  costs  and  lower 
than  desired  system  safety.  Addressing 
these  issues  should  simply  be  a  part  of 
the  software  safety  development  process. 

Laying  the  Measurement 
Foundation 

Once  it  is  clear  that  the  safety  process  has 
been  established,  deeper  investigation  of 
each  inquiry  area  can  be  performed.  An 
example  set  of  software  safety  visibility 


Table  2:  Software  Safety  Visibility  Needs 


Inquiry 

Area 

Goal 

Software  Safety  Visibility 
Questions 

Software 

Safety 

Analysis 

Process 

Check  how  well  each 
organization,  system,  and 
integrator  is  addressing  software 
safety  in  the  system  hazard 
analysis  process. 

o  Have  a  reasonable  number  of 
software  safety-related 
requirements  been  identified? 

Hazard  and 

Mitigation 

Identification 

Check  if  a  reasonable  number  of 
software-related  hazards, 
causes,  controls,  and 
verifications  are  identified. 

o  Have  a  reasonable  number  of 
software  safety  hazards  been 
identified? 

o  Are  causes,  controls,  and 
verifications  being  generated 
over  time? 

o  Does  every  cause  have  at 
least  one  control? 
o  Does  every  control  have  at 
least  one  verification? 

Hazard 

Monitoring 

Check  if  software-related 
hazards  (and  hazard  software 
components,  i.e. ,  causes, 
controls,  and  verifications)  are 
identified  and  closed  at  an 
appropriate  rate. 

o  Have  the  number  of  open 
software  causes/controls  for 
hazards  decreased  overtime? 

Appropriate 
LoR  for 
Software 
Safety 

Check  if  the  various  software 
development  groups  are 
assigning  reasonable  levels  of 
rigor  to  safety-related  software. 

o  Have  the  appropriate  levels  of 
rigor  been  allocated  to 
software  development? 

Safety 

Defects 

Check  if  software  safety-related 
defects  are  being  handled. 

o  Have  safety-related  software 
defects  been  closed  at  a 
reasonable  rate  over  time? 
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Inquiry  Area 

Measure(s) 

Model(s) 

Response(s) 

Software 

Safety 

Analysis 

Process 

Percent  Software  Safety 
Requirements  (PSSR) 

Estimated  PSSR  (EPSSR) 

PSSR  =  #  software  safety 
requirements  /  #  software 
requirements  *100 

if  |PSSR  -  EPSSR|  <  e  then  a 

reasonable  number  of  software  safety 
requirements  have  been  identified 
where 

the  EPSSR  =  the  average  of  the 

PSSRs  for  all  systems  in  the  family, 

(in  line  with  other  systems)  and  e  =  6 
(EPSSR)  (i.e.,  standard  deviation  of  the 
PSSRs  used  to  calculate  EPSSR) 

or 

EPSSR  =  #system  safety  requirements 
/#system  requirements  *  100,  (in  line 
with  system  safety  in  general)  and 
e  =  20%  of  EPSSR. 

PSSR  not  being  within  the  range  of 
EPSSR  should  indicate  the  need  for  a 
management  action.  For  example, 
check  into  the  safety  hazard  elicitation 
process  and  whether  it  is  being 
applied  correctly,  investigate  the  reason 
why  the  system  under  consideration 
has  such  a  small  (or  large) 
percentage  of  safety  requirements, 
and  develop  a  “get  well”  plan.  If  the 
value  is  too  large,  what  are  the  cost 
and  schedule  implications  of 
corrective  actions? 

Hazard 

Monitoring 

Hazard  cause/control  closure 
evolution  (HCCE) 

HCCEii3=  MAi+i,3/MAi,3 

where 

MAii3  =  (Xi-2  +  Xm  +  Xi )/  3 
is  the  moving  average  of  the  set 
of  open  causes  (controls)  at 
three  consecutive  time  intervals. 

If  HCCEji3>  1  then  the  closure  rate  of 
hazard  software  causes/controls  is 
not  converging. 

If  the  number  is  >  1  and  it  is  not  in  the 
beginning  phases  of  development, 
more  effort  should  go  into  closing  the 
hazard  software  causes/controls.  If  it 
is  because  the  opens  are  increasing 
too  fast  (new  hazards  are  being 
introduced,  new  causes  for  existing 
hazards),  then  investigate  the 
reasons.  If  it  is  because  the  closes 
are  not  increasing  fast  enough,  then 
investigate  the  reasons. 

Graphing  the  cumulative  identified, 
open,  and  closed  causes/controls 
provides  good  insight  into  the  trends 
of  these  variables. 

Safety 

Defects 

Count  by  priority  of  open  safety- 
related  software  trouble  reports 
at  time  i  (COSRTR). 

If  COSRTR  t  0  then  there  are  open 
defects  that  need  further  analysis. 

If  all  safety-related  defects  are  not 
closed,  then  create  a  list  of  open 
defects,  prioritize  them,  and  investigate 
why  they  exist.  This  measure  should  be 
taken  periodically  starting  at  the 
beginning  of  test  and  up  until  safety 
assessment  report  delivery. 

Table  3:  Some  Examples  of  Software  Safety  Measures 


goals  and  questions  is  presented  in  Table  2. 
When  a  readiness  assessment  question 
has  been  satisfied,  the  software  safety  vis¬ 
ibility  questions  and  measures  through¬ 
out  the  life  cycle  of  the  program  can  be 
applied. 

Establishing  the  measures  requires 
more  than  identifying  the  data  to  be  col¬ 
lected.  Each  measure  is  characterized  in 
terms  of  the  question  it  answers,  the  model 
used  to  interpret  its  values  in  order  to 
answer  the  target  question,  the  response 
that  suggests  the  action  to  be  taken  based 
upon  the  answer  to  the  question,  and  the 
scope  of  applying  the  measure.  Table  3 
presents  examples  of  models  and  responses 
for  three  of  the  five  inquiry  areas1. 

For  each  model,  assumptions  were 
made  about  how  the  resulting  measure¬ 
ments  should  be  interpreted.  An  expected 
value  and  a  range  are  selected  for  within 
which  the  actual  is  acceptable.  The 
expected  value  can  be  derived  by:  (1)  his¬ 
torical  data  from  past  programs,  (2)  prior 


data  from  the  current  program,  (3)  proxy 
estimate  (i.e.,  comparison  with  something 
similar),  or  (4)  expert  estimate.  The  range 
of  the  expected  values  can  be  based  on 
general  distributions  or  specific  or  related 
experience. 

If  the  calculated  value  is  not  within 
the  expected  range,  then  there  may  be  a 
problem.  Expected  values  or  ranges  can 
be  improved  over  time  based  upon  the 
incorporation  of  new  data  into  the 
model. 

To  illustrate  these  concepts,  consider 
one  measure  proposed  for  the  process 
area,  PSSR,  which  is  defined  as  PSSR  —  # 
software  safety  requirements  /  #  software 
requirements  *100.  The  model  can  be 
defined  as: 

/7|PSSR  -  EPSSR|  <  e 
where  EPSSR  is  the  estimated  value  of 
PSSR,  e  is  the  acceptable  threshold  for 
deviation  from  the  estimate,  and  (EPSSR 
-e,  EPSSR  +e)  is  the  acceptable  range, 


then  a  reasonable  number  of  software 
safety  requirements  have  been  identified. 

The  key  is  to  have  good  estimates  for 
EPSSR  and  e.  Ideally,  historical  data 
should  be  used  and  the  estimated  value 
and  range  (i.e.,  sigma,  the  standard  devia¬ 
tion)  is  taken  from  a  similar  system  or  sub¬ 
system.  However,  there  may  be  little  his¬ 
torical  data.  In  this  case,  proxies  are  iden¬ 
tified  for  estimates2. 

One  possible  proxy  is  to  use  system 
safety  requirements  as  the  benchmark  for 
software  safety  requirements.  We  can  let 
the  range  be  defined  by  some  percentage 
around  that  value  that  provides  initially 
acceptable  limits.  Once  the  program  is 
under  development,  early  data  can  be  sub¬ 
stituted  on  the  program  for  these  proxies. 
Thus: 

EPSSR  =  #system  safety  requirements  / 
#system  requirements  *100 
and  e  =  20  percent  of  EPSSR. 
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The  model  is  interpreted  by  defining  a 
response  if  the  resulting  value  is  not  within 
range.  For  example,  if  PSSR  is  not  within 
the  range  e  of  EPSSR,  it  indicates  the  need 
for  management  action.  One  example 
would  be  to  check  into  the  safety  analysis 
process  and  whether  it  is  being  appropri¬ 
ately  applied,  investigate  the  reason  why 
the  system  under  consideration  has  such  a 
small  (or  large)  percentage  of  safety 
requirements,  and  develop  a  get  well  plan. 

In  defining  these  measures,  existing 
data  sources  (e.g.,  a  hazard  tracking  data¬ 
base,  requirements  management  reposi¬ 
tories,  and  defect  tracking  systems)  and 
processes  (e.g.,  safety  analysis  processes) 
were  leveraged.  This  can  be  done  provid¬ 
ed  that  the  assumptions  upon  data  col¬ 
lection  (listed  in  the  Terminology  and 
Key  Concepts  section)  are  true.  The 
derived  measures  in  Table  3  can  be 
graphically  represented  (e.g.,  as  evolution 
over  time),  as  appropriate,  for  the  analy¬ 
sis  results  on  the  questions  it  helps  to 
answer.  Key  issues  for  determining  soft¬ 
ware  safety  visibility  are:  (1)  selecting  the 
right  subset  of  measures,  (2)  defining 
appropriate  thresholds,  (3)  determining 
appropriate  management  responses,  and 
(4)  providing  user-friendly  reports  and 
actionable  responses;  all  of  these  issues 


are  program-dependent. 

The  safety  measures  collected  by  a 
program  form  the  beginning  of  an  expe¬ 
rience  base,  which  creates  a  historical  base 
across  current  programs  within  a  program 
and  for  future  programs.  To  date,  there  is 
very  little  data  on  which  to  calibrate  the 
models.  It  is  hoped  that  programs  will 
start  collecting  data  so  that  more  knowl¬ 
edge  can  be  obtained  and  software  safety 
measure  baselines  can  be  established. 

Conclusion 

The  methodology  presented  here  should 
be  tailored  to  fit  the  context  of  the  orga¬ 
nization;  it  is  not  intended  to  imply  a  cor¬ 
rect  answer  or  an  all  or  nothing  approach. 
The  areas  of  inquiry  and  the  measures  can 
be  adjusted  appropriately;  however,  as  a 
minimum,  any  program  dealing  with  safe¬ 
ty  should  at  least  address  the  readiness 
questions. 

Gaining  visibility  through  objectives 
measures  into  software  safety  has  become 
increasingly  important  for  today’s  software- 
intensive  programs.  Although  software 
safety  measures  cannot  determine  whether 
a  system  is  safe,  they  can  provide  valuable 
indicators  of  problems  and  risks  that  give 
management  critical  knowledge  for  making 
timely  and  well-informed  decisions. ♦ 
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Many  military  systems  are  safety-critical,  with  failure  possibly  resulting  in  loss  of  human  life.  In  today's  interconnected  envi¬ 
ronment,  safety  requires  security.  Compliance  with  the  higher  levels  of  safety  or  security  standards  demands  a  disciplined  devel¬ 
opment  process  along  with  appropriate  programming  language  and  toolset  technology.  Since  full,  general-purpose  languages  are 
too  large  and  complex  to  be  usable  for  safety-critical  or  high-security  systems,  a  key  requirement  is  to  define  subsets  that  are 
simple  enough  to  facilitate  certification  but  expressive  enough  to  program  the  needed  application  functionality.  This  article  sum¬ 
marises  representative  safety  and  security  standards  (DO-178B  and  the  Common  Criteria,  respectively),  identifies  the  lan¬ 
guage-related  issues  surrounding  safety  and  security  certification,  and  assesses  three  candidate  technologies — C  (including 
C++),  M  da  (including  SPARK),  and  Java — with  respect  to  suitability  for  safety -critical  or  high-security  systems. 


Compliance  with  formal  safety  stan¬ 
dards  is  becoming  a  major  considera¬ 
tion,  and  sometimes  a  requirement,  in 
defense  systems.  The  safety  standard  that 
is  most  relevant  is  DO-178B  [1],  which  is 
used  by  the  Federal  Aviation  Administra¬ 
tion  (FAA)  for  the  certification  of  com¬ 
mercial  aircraft.  DO-178B  is  process-ori¬ 
ented;  certification  is  not  so  much  a  safety 
assessment  of  the  completed  system,  but 
rather  an  evaluation  of  a  set  of  developer- 
provided  artifacts.  These  artifacts  are 
intended  to  provide  both  direct  evidence  of 
sound  software  engineering  practice  and 
lend  assurance  (through  indirect  evidence) 
of  the  safety  of  the  resulting  system. 

DO-178B  identifies  five  levels  of  criti¬ 
cality:  A  (highest)  through  E  (lowest). 
Table  1  characterizes  the  various  levels 
and  identifies  the  number  of  DO-178B 
requirements  that  apply.  The  term  with 
independence  means  that  the  evidence  for 
meeting  a  requirement  must  be  supplied 
by  someone  other  than  the  developer.  The 
term  safety-critical  generally  applies  to  soft¬ 
ware  at  levels  A  and  B,  which  demand 
greater  rigor  and  more  comprehensive 
analysis  than  the  lower  levels. 

DO-178B  focuses  on  requirements- 
based  testing  and  bi-directional  traceabili¬ 
ty  (from  requirements  to  code,  and  vice 
versa)  as  key  elements  of  software  verifi¬ 
cation.  Test  cases  must  fully  cover  the 
code;  dead  code  (unexercised  code  that  does 
not  correspond  to  a  specific  requirement) 
must  be  removed. 

DO-178B  is  open  to  criticism  on  sev¬ 
eral  grounds: 

•  It  does  not  directly  assess  the  safety  of 
the  resulting  system. 

•  Although  the  artifacts  are  process-ori¬ 
ented,  there  is  no  guarantee  that  sound 
processes  were  followed  and  it  is  not 
rare  for  developers  to  prepare  the  DO- 
178B  artifacts  for  previously  devel¬ 
oped  components  a  posteriori. 

•  Its  emphasis  on  testing  does  not  ade¬ 


quately  take  into  account  alternative 
technologies  (such  as  formal  methods) 
for  providing  safety  assurance. 

•  It  is  unclear  on  how  its  objectives 
apply  to  modern  software  develop¬ 
ment  approaches  such  as  object-ori¬ 
ented  technology  (OOT). 

These  problems  should  not  be  overem¬ 
phasized.  DO-178B  has  been  successful  in 
practice:  Although  there  have  been  some 
close  calls,  DO-178B-certified  software  has 
never  been  the  direct  cause  of  an  aircraft 
accident  resulting  in  a  fatality.  However, 
much  has  changed  in  the  software  industry 
since  the  early  1990s  when  DO-178B  was 
written.  Work  is  in  progress  on  a  successor, 
DO-178C  [2],  that  will  attempt  to  address 
some  of  the  perceived  issues  with  DO- 
178B.  For  example,  DO-178C  will  accom¬ 
modate  newer  software  technologies 
(OOT,  model-based  design)  and  alternative 
software  verification  techniques  (formal 
methods,  abstract  interpretation). 

Security  Certification 

Security  is  generally  defined  as  the  protec¬ 
tion  of  assets  against  threats  to  their  confi¬ 
dentiality,  integrity ,  and/or  availability. 
Designing  an  information  technology  (IT) 
product  for  security  thus  involves  design 
steps  (avoiding  vulnerabilities  that  adver¬ 
saries  could  exploit  to  compromise  these 
requirements)  as  well  as  runtime  actions 


(detecting  and  responding  to  attempted 
breaches). 

DO-178B  says  nothing  explicit  about 
security,  but  a  system  with  security  vulner¬ 
abilities  is  at  risk  for  exploitation  by  adver¬ 
saries  to  render  it  unsafe.  The  safety  requires 
security  principle  has  two  implications.  First, 
an  organization  developing  safety-critical 
software  should  adopt  methodologies  and 
design  frameworks  that  can  help  realize 
security  requirements.  Guidelines  such  as 
the  “Defense  Acquisition  Guidebook”  [3] 
and  architectures  such  as  multiple  inde¬ 
pendent  levels  of  security  (MILS)  [4]  are 
relevant.  Second,  it  must  be  possible  to 
assess  whether  a  product  with  safety-criti¬ 
cal  requirements  is  sufficiently  secure.  The 
Common  Criteria/ Common  Evaluation 
Methodology  [5]  provides  such  an  assess¬ 
ment  mechanism.  These  international 
standards  include  a  catalog  of  security- 
functional  and  assurance  requirements  and 
a  process  for  evaluating  the  security  char¬ 
acteristics  of  a  given  IT  product. 

Somewhat  analogous  to  the  levels  of 
DO-178B,  the  Common  Criteria  defines 
seven  Evaluation  Assurance  Levels 
(EALs),  numbered  from  1  to  7  in  increas¬ 
ing  order  of  criticality.  Generally  speaking, 
EAL  4  corresponds  to  best  commercial 
practice  without  a  serious  focus  on  security 
threats;  higher  levels  require  special  securi¬ 
ty-oriented  mechanisms  and  increased 


Table  1:  Criticality  Levels  in  DO-1 7 8B 


Level 

Condition 

Effect  of  Anomalous  Behavior 

Number  of 
Objectives 

A 

Catastrophic 

failure 

“...  prevent  continued  safe  flight  and 
landing...” 

66  (14  with 
independence) 

B 

Hazardous/severe 

failure 

“...  serious  or  potentially  fatal  injuries 
to  a  small  number  of ...  occupants  ...” 

65  (1 1  with 
independence) 

C 

Major  failure 

“...discomfort to  occupants,  possibly 
including  injury...” 

57 

D 

Minor  failure 

“...  some  inconvenience  to 
occupants...” 

28 

E 

None 

“...  no  effect  on  aircraft  operational 
capability  or  pilot  workload...” 
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effort  in  demonstrating  compliance.  At 
EAL  7,  formal  methods  (i.e.,  mathematics- 
based  analyses)  are  required  to  demonstrate 
that  security  requirements  are  met. 

With  safety  implying  the  need  for 
security,  it  is  reasonable  to  consider  certi¬ 
fying  a  system  against  standards  for  both. 
This  idea  is  not  new;  the  SafSec  project 
[6],  sponsored  by  the  United  Kingdom’s 
Ministry  of  Defense,  has  developed  an 
integrated  methodology  for  dual  safety 
and  security  certification  for  avionics.  In 
another  effort,  a  group  at  the  University  of 
Idaho  has  analyzed  the  correspondence 
between  DO-178B  and  the  Common 
Criteria,  mapping  DO-178B  objectives  to 
Common  Criteria  elements  [7],  and  has 
studied  the  feasibility  of  joint  certification. 
However,  the  practicality  of  applying  the 
Common  Criteria  to  large  Department  of 
Defense  (DoD)  systems  is  unclear.  As 
summarized  in  a  2007  Report  of  the 
Defense  Science  Board  Task  Force: 

Criticisms  of  Common  Criteria- 
based  schemes  are  that  they  are 
expensive,  require  artifacts  that  are 
not  produced  until  well  after  product 
design  and  implementation,  do  not 
substantially  reduce  implementation- 
level  vulnerabilities  when  using 
today’s  software  development  prac¬ 
tices,  and  lack  thorough  penetration 
analysis  at  EAL  4  and  below.  [8] 

Furthermore,  the  fact  that  a  product 
has  been  certified  at  a  specific  EAL  means 
very  little  by  itself.  First,  it  says  nothing 
about  the  quality  of  the  product  outside 
the  security  functional  requirements. 
Second,  a  prospective  consumer  needs  to 
understand  which  of  these  requirements 
are  being  implemented  and  whether  the 
vendor-assumed  operational  environment 
(the  severity  of  the  threats /assumed  skills 
of  the  adversaries)  matches  reality. 

Notwithstanding  how  it  is  assessed, 
security  is  obviously  necessary  for  safety, 
and  safety  certification  agencies  are  paying 
increasing  attention  to  the  relationship 
between  the  two.  As  an  example,  an  FAA 
“Special  Conditions”  notice  directed  one 
supplier  to  demonstrate  the  independence 
of  the  networks  for  passenger-accessible 
components  and  flight  control  on  one  of 
its  aircraft  [9]. 

A  Comparison  of  Safety  and 
Security  Certification  Issues 

DO-178B  and  the  Common  Criteria  have 
some  basic  similarities: 

•  Concern  with  the  full  software  devel¬ 
opment  life  cycle — including  peripher¬ 


al  activities  such  as  configuration  man¬ 
agement — in  an  attempt  to  catch 
human  (developer)  error  before  the 
system  is  fielded. 

•  Tiered  approach  (criticality  levels) 
reflecting  real-life  trade-offs:  Resources 
are  finite,  and  a  system  must  be  safe/ 
secure  enough  for  its  intended  purpose. 

•  Emphasis  on  testing  as  a  major  ele¬ 
ment  of  software  verification. 

There  are  also  some  important  differ¬ 
ences: 

•  Scope  of  requirements.  DO-178B 
deals  with  the  entire  system;  the 
Common  Criteria  focuses  almost 
exclusively  on  just  the  security  func¬ 
tional  requirements. 

•  Functional  requirements.  There  is 
no  specific  set  of  safety  functions 
called  out  in  DO-178B.  In  contrast, 
the  security  domain  has  well-defined 
functional  requirements  that  need  to 
be  implemented. 

•  System  users/operators.  An  IT 

product  must  be  immune  to  attacks 
from  unknown  and  possibly  malevo¬ 
lent  users  who  can  directly  supply 
input.  Input  to  a  safety-critical  system 
is  generally  supplied  by  known  opera¬ 
tors  whose  trustworthiness  has  been 
separately  vetted. 

In  one  sense,  compliance  with  safety 
standards  is  more  demanding: 

•  Each  component  must  be  certified 
against  requirements  for  its  safety  level. 

•  At  the  higher  levels,  it  is  necessary  to 
both  demonstrate  the  absence  of  dead 
code  and  perform  structural  testing  to 
verify  the  absence  of  non-required 
functionality. 

•  For  EAL  compliance,  the  specific  devel¬ 
opment  and  testing  requirements  apply 
only  to  the  security  functions  and  not  to 
the  entire  IT  product;  there  is  no  prohi¬ 
bition  against  dead  code/extra  function¬ 
ality  (although  such  code  must  be 
shown  to  be  free  from  vulnerabilities). 
In  another  sense,  compliance  with 

security  standards  is  more  demanding: 

•  Formal  methods  are  required  at  EAL  7. 

•  Vulnerability  analysis  is  difficult  and 
must  assume  a  sophisticated  and 
malevolent  adversary;  for  safety,  the 
adversaries  are  the  laws  of  physics. 
Although  safety  requires  security,  the 

relationship  in  the  other  direction  is  not  so 
immediate.  Most  IT  products  for  which 
security  is  critical  do  not  control  systems 
where  life  is  at  stake  and,  thus,  safety  is 
generally  not  an  issue. 

In  the  context  of  overall  system 
design,  safety  and  security  sometimes  con¬ 
flict,  especially  with  respect  to  behavior 
under  failure  conditions.  Taking  a  system 


offline  to  protect  data  may  be  reasonable 
behavior  for  security,  but  if  the  data  are 
needed  for  flight  control/management, 
then  such  a  policy  may  have  disastrous 
consequences  for  safety.  Fail-safe  is  not  the 
same  thing  as  fail-secure.  These  sorts  of 
conflicts  need  to  be  resolved  during 
design,  with  appropriate  trade-offs  based 
on  the  anticipated  risks. 

Programming  Language 
Requirements 

The  programming  language  choice  is 
arguably  the  most  important  technical 
decision  that  the  developer  organization 
will  make.  As  summarized  in  a  National 
Academy  of  Sciences  report: 

The  overwhelming  majority  of 
security  vulnerabilities  reported  in 
software  products — and  exploited 
to  attack  the  users  of  such  products 
— are  at  the  implementation  level. 
The  prevalence  of  code-related 
problems,  however,  is  a  direct  con¬ 
sequence  of  higher-level  decisions 
to  use  programming  languages, 
design  methods,  and  libraries  that 
admit  these  problems.  [10] 

Although  directed  at  security  issues, 
these  comments  apply  equally  to  safety. 
The  programming  language  plays  a  key 
role  in  determining  the  ease  or  difficulty 
of  developing  software  that  avoids  vulner¬ 
abilities  and  that  is  certifiable  against  safe¬ 
ty  or  security  standards. 

A  simple  example  of  a  programming 
language  feature  that  can  easily  lead  to  an 
application  vulnerability  is  the  C  library 
function  getsQ,  which  reads  a  character 
string  as  input  from  a  user  until  an  end-of- 
line  or  end-of-file  is  encountered.  The  pro¬ 
gram  can  only  pre-allocate  an  area  of 
some  fixed  length  as  the  destination,  but  a 
user  can  accidentally  or  intentionally  sup¬ 
ply  input  that  exceeds  this  bound.  The 
effect  is  the  classical  buffer  overflow ,  in 
which  the  excess  characters  overwrite 
other  data,  possibly  including  a  function’s 
return  address.  By  crafting  an  input  string 
with  specific  content,  a  malevolent  user 
can  take  control  of  the  machine  to  execute 
arbitrary  code. 

Although  DO-178B  and  the  Common 
Criteria  do  not  offer  direct  guidance  on  a 
programming  language  choice,  it  is  possi¬ 
ble  to  abstract  from  their  specific  objec¬ 
tives  and  infer  several  general  require¬ 
ments  that  a  language  must  meet.  The  fol¬ 
lowing  sections  discuss  four  of  these 
requirements:  reliability,  predictability, 
analyzability,  and  expressiveness. 
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Approach 

Group  Name 

Technique(s) 

Static  Analysis 

Flow  Analysis 

Control  Flow 

Data  Flow 

Information  Flow 

Symbolic  Analysis 

Symbolic  Execution 

Formal  Code  Verification 

Range  Checking 

Range  Checking 

Stack  Usage 

Stack  Usage 

Timing  Analysis 

Timing  Analysis 

Other  Memory  Usage 

Other  Memory  Usage 

Object  Code  Analysis 

Object  Code  Analysis 

Dynamic  Analysis 
(Testing) 

Requirements-Based  Testing 

Equivalence  Class 

Boundary  Value 

Structure-Based  Testing 

Statement  Coverage 

Branch  Coverage 

Modified  Condition/ 

Decision  Coverage 

Table  2:  Analysis  Techniques 


Reliability 

The  language  should  promote  the  devel¬ 
opment  of  readable,  correct  code  as  well 
as: 

•  Have  an  intuitive  lexical  and  syntactic 
structure  and  be  free  of  traps  and  pit- 
falls. 

•  Help  detect  errors  early  (at  compile 
time  if  possible),  and  should  prevent 
errors  such  as  out-of-range  array 
indices  and  references  to  uninitialized 
data. 

•  Help  (if  it  supports  concurrent  pro¬ 
gramming  with  threads /  tasks)  in  avoid¬ 
ing  errors  such  as  unprotected  access¬ 
es  to  shared  data,  race  conditions 
(where  the  effect  of  the  program 
depends  on  the  relative  speed  of  the 
threads /tasks),  and  deadlock. 

Predictability 

The  language  specification  should  be 
unambiguous.  Implementation-dependent 
or,  worse,  undefined  behavior  introduces 
vulnerabilities  because  the  effect  of  the 
program  may  not  be  as  the  developer  had 
intended. 

Analyzability 

The  language  should  facilitate  both  static 
analysis  (detecting  uninitialized  variables, 
identifying  dead  code,  predicting  maxi¬ 
mum  stack  usage  and  worst-case  execu¬ 
tion  time,  etc.)  and  dynamic  analysis 
(requirements-based  or  structure-based 
testing).  A  useful  catalog  of  such  analysis 
techniques  (from  [1 1])  is  given  in  Table  2. 

The  various  analysis  techniques 
impose  constraints  on  the  programming 
language.  For  example,  control  and  data 
flow  analyses  generally  prohibit  the  use  of 
goto  statements,  stack  usage  analysis  gener¬ 
ally  prohibits  recursion,  and  coverage 
analysis  may  preclude  the  use  of  source 
code  constructs  that  generate  implicit 
loops  or  conditionals.  As  a  result,  there  is 
no  such  thing  as  the  safety-critical  or  high- 
security  subset  of  a  given  language.  The 
particular  subset  used  either  determines  or 
is  determined  by  the  analysis  techniques 
that  will  assist  in  demonstrating  compli¬ 
ance  with  the  operative  certification  stan¬ 
dard. 

Automated  static  analysis  tools  play  an 
important  role  in  the  safety  and  security 
domains;  indeed,  there  is  a  U.S.  Depart¬ 
ment  of  Homeland  Security-sponsored 
project  under  way — Static  Analysis 
Metrics  and  Tool  Evaluation  (SAMATE) 
[12] — identifying  and  measuring  the  effec¬ 
tiveness  of  such  tools.  Static  analysis  tools 
are  most  successful  during  program  devel¬ 
opment,  where  they  can  help  detect  prob¬ 
lems  before  they  occur,  versus  as  a  tech¬ 


nique  for  detecting  vulnerabilities  a  posteri¬ 
ori  in  existing  code. 

Expressiveness 

The  language  should  support  general-pur¬ 
pose  programming,  either  through  lan¬ 
guage  features  or  auxiliary  libraries,  and 
should  also  offer  specialized  functionality 
(as  required).  For  real-time  safety-critical 
systems,  this  means  support  for  interrupt 
handling,  low-level  programming,  concur¬ 
rency,  perhaps  fixed-point  arithmetic,  and 
other  features.  For  high-security  systems, 
it  means  mechanisms  are  needed  for 
implementing  security-functional  require¬ 
ments  (e.g.,  for  cryptography). 

Unfortunately,  language  generality  (as 
implied  by  the  expressiveness  require¬ 
ment)  directly  conflicts  with  the  analyz¬ 
ability  requirement.  That  conflict  compli¬ 
cates  the  language  selection  decision. 

Object-Oriented  Technology 

The  history  of  programming  languages 
has  seen  a  steady  evolution  of  features 
that  promote  maintainable  software  and 
many  of  these  features  directly  support 
the  reliability  and  analyzability  require¬ 
ments  previously  described.  However, 
some  of  the  advances  present  difficulties 
for  safety  and  security  certification. 
Perhaps  the  most  significant  example  is 
OOT  [13],  found  in  such  languages  as 
C++,  Ada  95,  and  Java.  OOT  is  not 
addressed  in  DO-178B,  and  there  are 
indeed  a  number  of  challenges: 

•  A  paradigm  clash.  OOT’s  distribu¬ 
tion  of  functionality  across  classes 
conflicts  with  DO-178B’s  focus  on 
tracing  between  requirements  and 
implemented  functions. 

•  Technical  issues.  The  features  that 
are  the  essence  of  OOT  complicate 
safety  and  security  certification.  For 


example,  dynamic  binding  typically  is 
implemented  by  a  compiler-generated 
data  structure  known  as  a  vtahle  (a  table 
of  addresses  of  functions).  For  safety 
or  security  certification,  the  developer 
must  demonstrate  that  the  vtahle  is 
properly  initialized  and  that  it  cannot 
be  corrupted. 

•  Cultural  issues.  Certification  authori¬ 
ty  personnel  are  not  necessarily  lan¬ 
guage  experts  and  may  (rightfully)  be 
concerned  about  how  to  deal  with 
unfamiliar  technology. 

A  series  of  workshops  several  years  ago 
produced  a  handbook  [14]  that  addressed 
these  issues  in  detail.  The  in-progress  work 
on  DO-178C  is  taking  these  into  account, 
and  it  is  likely  that  the  eventual  new  stan¬ 
dard  will  offer  some  direct  guidance  in  con¬ 
nection  with  OOT.  However,  developers 
are  not  waiting  for  DO-178C.  OOT  is  cur¬ 
rently  being  used  in  safety-critical  code;  as 
one  example,  an  avionics  system  using  Ada 
95’s  object-oriented  features  has  been  certi¬ 
fied  at  Level  A.  It  seems  inevitable — as 
experience  with  OOT  and  certification  is 
gained — that  usage  of  object-oriented 
languages  will  increase. 

Candidate  Programming 
Languages 

Although  (in  principle)  any  programming 
language  could  be  used  for  developing 
safety-critical  or  high-security  software, 
the  requirements  for  reliability,  pre¬ 
dictability,  and  especially  analyzability 
imply  that  suitable  subsets  be  chosen.  The 
key  issues  are  how  a  language  can  be  sub¬ 
setted  to  ease  certification  for  applications 
restricted  to  the  subset,  and  whether  the 
language  has  intrinsic  problems  that  can¬ 
not  be  removed  by  subsetting. 

This  section  summarizes  how  several 
current  language  technologies — either 
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currently  in  use  or  under  consideration  for 
safety-critical  systems — compare  with 
respect  to  subsettability. 

C-Based  Technology2 
MISRA-C 

The  United  Kingdom-based  Motor 
Industry  Software  Reliability  Association 
(MISRA)  has  produced  a  set  of  language 
restrictions,  called  MISRA-C  [16],  which 
attempts  to  mitigate  C’s  vulnerabilities. 
MISRA-C  codifies  best  practices  for  C  pro¬ 
gramming  and  has  become  somewhat  of  a 
de  facto  standard  as  a  C  subset  for  critical 
systems.  Benefits  stem  from  C’s  relative 
simplicity,  the  large  population  of  C  pro¬ 
grammers,  and  a  wide  assortment  of  tools 
and  service  providers.  MISRA-C  has  been 
used  successfully  in  safety-critical  systems. 

On  the  other  hand,  MISRA-C  has 
some  significant  drawbacks: 

•  C  was  not  designed  for  safety-critical 
systems,  and  some  intrinsic  issues  (e.g., 
the  wraparound  semantics  for  integer 
overflow)  cannot  be  removed  by  sub- 
setting. 

•  Despite  MISRA-C’s  stated  goals,  the 
rules  are  not  always  enforceable  by  sta¬ 
tic  tools,  and  different  tools  may 
enforce  the  subset  differently. 

•  Since  concurrency  is  not  provided  in  C 
(it  is  only  available  through  external 
libraries),  MISRA-C  offers  no  guid¬ 
ance  on  how  to  use  C  in  a  multi¬ 
threaded  environment. 

C++ 

C++  [17]  is  in  many  ways  a  better  C. 
Developers  of  safety-critical  systems 
often  have  staff  knowledgeable  in  C++ 
and  may  possess  existing  C++  compo¬ 
nents  that  they  would  like  to  re-use  in  a 
certified  system. 

To  help  meet  this  goal,  coding  stan¬ 
dards  such  as  Joint  Strike  Fighter  C++ 
[18],  and  MISRA  C++  [19]  have  been 
developed.  These  extend  or  adapt 
MISRA-C  to  deal  with  C++’s  additional 
facilities.  The  rules  constrain  the  usage  of 
language  features  in  order  to  avoid  prob¬ 
lems  and  to  promote  good  style. 

Safe  C++  coding  standards  are  essen¬ 
tial  if  C++  is  chosen,  and  C++  has  been 
used  to  develop  safety-critical  systems. 
However,  the  previously  noted  drawbacks 
for  MISRA-C  apply  here,  and  the  OOT 
coding  guidelines  do  not  solve  the  under¬ 
lying  certification  issues. 

Ada-Based  Technology 
Ada 

Ada  [20]  was  designed  to  be  used  for  safe¬ 
ty-critical  systems.  It  avoids  many  of  the  C 
and  C++  vulnerabilities  (e.g.,  checking  for 


out-of-range  array  indexing  and  integer 
overflow),  and  also  offers  a  standard  set  of 
concurrent  programming  features.  Ada 
continues  being  widely  used  for  safety- 
critical  systems  including  military  and 
commercial  avionics. 

Full  Ada  is  too  large  to  be  practical  for 
safety  certification  so  subsetting  is 
required.  Ada  provides  a  unique  approach 
to  this  issue,  allowing  the  application  to 
specify  the  features  that  are  to  be  excluded. 
This  means  no  runtime  support  libraries 
for  such  features,  and  compile-time  error 
detection  of  attempted  uses.  The  a  la  carte 
style  to  defining  language  subsets  is  flexi¬ 
ble  and  does  not  require  specialized  tool 
support:  a  standard  compiler  performs  the 
necessary  analysis. 

The  latest  Ada  language  standard  also 
includes  the  Ravenscar  profile  [21],  a  cer¬ 
tifiable  subset  of  concurrency  features. 

“Ada  continues  being 
widely  used  for 
safety-critical  systems 
including  military  and 
commercial  avionics.” 

Ada’s  disadvantages  for  safety-critical 
systems  are  largely  external  (non-techni- 
cal).  Ada  usage  is  not  as  widespread  as 
other  languages  and,  thus,  its  tool  vendor 
community  is  smaller.  On  the  technical 
side,  Ada  does  not  directly  address  vulner¬ 
abilities  such  as  references  to  uninitialized 
variables.  As  with  C  and  C++,  supple¬ 
mentary  analysis  is  required  to  detect/ pre¬ 
vent  such  errors. 

SPARK 

SPARK  [22]  is  a  subset  of  Ada  95,  aug¬ 
mented  by  specially  formed  comments 
known  as  contracts  (or  annotations),  designed 
to  facilitate  a  rigorous,  static  demonstra¬ 
tion  of  program  correctness.  SPARK 
omits  features  that  complicate  analysis  or 
formal  proofs  or  that  interfere  with 
bounded  time/ space  predictability.  The 
language  includes  most  of  Ada  95’s  static 
features  as  well  as  the  Ravenscar  concur¬ 
rency  profile,  and  the  semantics  are  com¬ 
pletely  unambiguous  (no  implementation- 
dependent  or  undefined  behavior). 

Contracts  in  SPARK  specify  data  and 
information  flow,  inter-module  dependen¬ 
cies,  and  dynamic  invariants  (pre-/post- 
conditions,  assertions).  The  SPARK  tools 
analyze  the  program  to  ensure  that  the 


code  is  consistent  with  the  contracts  and 
that  no  runtime  exceptions  will  be  raised. 
They  detect  errors  such  as  potential  refer¬ 
ences  to  uninitialized  variables  and  dead 
code.  The  static  analysis  performed  by  the 
SPARK  tools  is  sound  (there  are  no  false  neg¬ 
atives ,  an  especially  important  requirement 
in  connection  with  safety  certification) 
with  a  low  false  alarm  rate  (there  are  few 
false  positives).  The  SPARK  tools  can  also 
generate  verification  conditions  and  auto¬ 
mate  the  proof  of  these  conditions. 

SPARK  has  been  used  in  practice  on  a 
variety  of  systems,  both  safety-critical  and 
high-security.  Of  all  the  candidate  lan¬ 
guage  technologies,  SPARK  best  meets 
the  requirements  for  reliability,  pre¬ 
dictability,  and  analyz ability.  Its  main  tech¬ 
nical  drawback  is  with  expressibility,  as  it 
has  a  rather  restricted  feature  set. 
Additionally,  the  SPARK  infrastructure 
(user/vendor  community)  is  smaller  than 
that  of  other  language  technologies. 

Java-Based  Technology 

Java  [23]  seems  simultaneously  logical  and 
curious  as  a  technology  choice  for  safety- 
critical  systems.  On  one  hand,  it  was 
designed  with  careful  attention  to  security: 
Its  initial  goal  was  to  enable  downloadable 
applets  to  be  executed  on  client  machines 
without  risk  of  compromising  the  confi¬ 
dentiality  or  integrity  of  client  resources. 
The  Java  language  is  largely  free  from  the 
implementation  dependencies  found  in  C, 
C++,  and  Ada,  such  as  order  of  expres¬ 
sion  evaluation.  Java  also  performs  con¬ 
servative  checks  to  prevent  unreachable 
(dead)  code  and  references  to  uninitialized 
variables.  It  provides  automatic  storage 
management  (garbage  collection)  instead  of 
an  explicit  free  construct  that  is  the  source 
of  subtle  errors  in  other  languages. 

Security,  however,  is  not  the  same  as 
safety.  Indeed,  Java  technology  has  limita¬ 
tions  for  safety-critical  systems,  falling  into 
two  general  categories: 

1 .  Ensuring  real-time  predictability. 

2.  Meeting  certification  standards  such  as 
DO-178B. 

Both  of  these  have  been  the  subject  of 
Java  Specification  Requests  (JSRs)  under 
Sun  Microsystems’  Java  Community 
Process  [24].  JSRs  1  [25]  and  282  [26]  have 
defined  the  Real-Time  Specification  for 
Java  (RTSJ);  JSR-302  [27],  in  progress,  is 
defining  a  subset  of  the  RTSJ  that  is 
intended  for  Java  applications  that  need  to 
be  certified  to  DO-178B  at  levels  up  to  A. 

The  RTSJ  extends  the  Java  platform  to 
add  real-time  predictability.  The  main 
enhancements  are  for  concurrency  (to 
define  scheduling  semantics  more  precise¬ 
ly  than  in  standard  Java,  and  to  prevent  pri- 
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ority  anomalies)  and  memory  management 
(to  avoid  garbage  collection  interference). 

The  RTSJ,  as  an  extension  of  the  stan¬ 
dard  Java  platform,  is  not  appropriate  (and 
was  not  intended)  for  safety- critical  appli¬ 
cations.  It  is  too  complex  and  some  of  its 
major  features  (especially  in  the  memory 
management  area)  require  runtime  checks 
that  may  be  expensive.  However,  it  does 
address  Java’s  real-time  issues  and,  thus,  is 
serving  as  the  basis  for  JSR  302’s  safety- 
critical  Java  specification.  This  in-progress 
effort  defines  three  levels  of  support  for 
safety-critical  systems:  (1)  a  traditional 
cyclic  executive  (no  threading);  (2)  a 
thread-based  approach  with  simple  mem¬ 
ory  management;  and  (3)  a  thread-based 
approach  with  more  general  memory 
management.  Each  is  characterized  by  a 
corresponding  subset  of  RTSJ  functional¬ 
ity  and  Java  class  libraries.  JSR-302 
exploits  Java  1.5’s  annotation  feature:  A 
developer  annotates  various  properties  of 
the  code  (for  example,  memory  usage), 
and  static  analysis  tools  verify  the  annota¬ 
tions’  correctness. 

Of  all  the  language  technologies  that 
are  candidates  for  safety-critical  develop¬ 
ment,  Java  has  the  most  significant  chal¬ 
lenges: 

•  The  Virtual  Machine  execution  envi¬ 
ronment  for  Java  programs  is  uncon¬ 
ventional,  blurring  the  distinction 
between  code  and  data  and  raising 
safety  certification  issues. 

•  Unlike  C++  and  Ada  (where  OOT  is 
available  but  optional),  Java  is  based 
around  object  orientation.  It  is  possible 
to  use  Java  without  taking  advantage  of 
OOT,  but  the  style  is  contrived. 

•  Java  is  lexically  and  syntactically  based 
on  C,  therefore  sharing  a  number  of 
that  language’s  traps  and  pitfalls. 

•  The  RTSJ /JSR-302  approach  to  mem¬ 
ory  management  is  rather  complicated, 
and  requires  Java  programmers  to  care¬ 
fully  analyze  dynamic  memory  usage. 
Despite  these  issues,  there  is  interest  in 

safety-critical  Java  from  both  the  develop¬ 
er  and  the  user  communities.  An  organiza¬ 
tion  that  has  adopted  Java  as  an  imple¬ 
mentation  language  on  a  project  may  have 
some  components  with  safety-critical 
requirements,  and  keeping  the  entire  sys¬ 
tem  within  one  language  can  simplify 
some  aspects  of  project  management. 

Conclusion 

Developing  safety-critical/high-security 
systems  is  difficult.  The  key  skill  is  not  so 
much  the  knowledge  of  a  particular  pro¬ 
gramming  language;  a  software  profession¬ 
al  should  be  able  to  learn  a  new  language 
in  a  short  amount  of  time.  The  more  crit¬ 


ical  talent  is  the  ability  to  think  through  a 
design  and  implementation  with  a  focus 
not  just  on  meeting  a  system’s  functional 
requirements  but  also  on  avoiding  hazards 
and  vulnerabilities.  Such  negative  program¬ 
ming — ensuring  that  bad  things  do  not 
happen — requires  careful  analysis  and  a 
defensive  development  approach  that,  in 
turn,  places  demands  on  the  programming 
language  and  tools.  For  software  safety  and 
security,  the  idea  of  minding your  language  is 
more  than  a  matter  of  etiquette;  it  could  be 
the  key  to  a  system’s  success.^ 
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Notes 

1.  This  article  is  based  on  a  tutorial, 
“Safety  and  Security:  An  Analysis  of 
Certification  Issues  and  Technolo¬ 
gies,”  presented  by  the  author  at  the 
Systems  and  Software  Technology 
Conference,  2008. 

2.  In  this  section,  C  means  the  1990  ver¬ 
sion  of  the  ISO  language  standard  [15]. 
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Letter  to  the  Editor 


Dear  CROSSTALK  Editor, 

Reading  August’s  20th  anniversary  edi¬ 
tion — especially  Gary  Petersen’s  CROSS¬ 
TALK:  The  Tong  and  Winding  Toad — trig¬ 
gered  my  own  early  memories  of  the  jour¬ 
nal  and  the  Software  Technical  Support 
Center  (STSC).  I  have  always  appreciated 
CROSSTALK,  from  the  authors’  real-world 
application  of  concepts  to  the  “above-and- 
beyond”  assistance  of  your  staff.  We  at 
Northrop  Grumman  put  what  we  learned 
into  practice. 

In  1989,  our  group  at  Northrop  (before 
adding  the  Grumman)  utilized  your  articles 
on  the  Department  of  Defense’s  (DoD) 
requirements  for  the  Capability  Maturity 
Model  Integration  (CMMI®)  Level  3.  While 
I  cannot  remember  the  authors  or  titles, 
these  articles  aided  our  logistics  engineers 
in  developing  a  quality  approach  as  we 
started  our  transformation  into  what  is 
now  much-renowned  Software  Engineer¬ 
ing  Processing  Group. 

In  the  mid-90s,  Watts  S.  Humphrey — 
who  was,  not  surprisingly,  part  of  the  20th 
anniversary  issue — was  one  of  several 
CROSSTALK  authors  whose  articles  point¬ 
ed  the  way  toward  DoD  systems  manage¬ 
ment  of  large-scale  software  and  systems 
engineering  integration.  As  well,  articles  on 
software  project  management  were  one  of 
the  tools  used  to  kick  off  Northrop’s 
CMMI  Level  3  effort  and  organize  a  more 
integrated  project  management  approach 
to  B-2  software  engineering. 


Around  this  same  time,  the  editorial 
team  of  CROSSTALK  invited  Northrop 
personnel  to  participate  in  STSC  meetings 
and  presentations,  and  introduced  us  to 
senior  DoD  system  managers.  These  ses¬ 
sions  were  the  genesis  of  our  software 
engineering  and  systems  engineering  break¬ 
throughs.  And,  although  it  may  seem  like  a 
small  gesture,  supplying  us  with  the  pro¬ 
ceedings  from  these  gatherings  helped  edu¬ 
cate  and  motivate  our  teams  of  software 
engineers  and  supported  our  argument  that 
we  needed  new  hardware  and  better  com¬ 
puter-aided  software  engineering  tools. 

We  saved  more  than  100,000  man¬ 
hours  by  implementing  methodologies 
gleaned  from  CROSSTALK,  equating  to 
approximately  $8.3  million  per  year  (in 
1993  dollars)  or  $33  million  over  the  four- 
year  period  of  development.  This  is  quite  a 
savings  when  compared  to  our  estimated 
$121  million  annual  budget.  We’ve  all 
received  plentiful  kudos  for  our  achieve¬ 
ments,  but  we  would  like  to  pass  along 
some  of  that  gratitude  to  CROSSTALK’S 
authors  and  staff. 

— John  B.  Burger 
Northrop  Grumman  (retired) 
4940  Flora  Vista  LN 
Sacramento,  CA  95822 
<  j  j  burger  @aol.  com  > 

®  CMMI  is  registered  in  the  U.S.  Patent  and  Trademark 
Office  by  Carnegie  Mellon  University. 
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WebBee:  A  Platform  for  Secure  Mobile 
Coordination  and  Communication  in  Crisis  Scenarios 


Sugih  Jamin 

University  of  Michigan1 

Recently,  disaster  scenarios  and  terrorist  attacks  have  made  apparent  some  fundamental  shortcomings  in  first  responders’  con¬ 
ventional  coordination  infrastructures.  For  example,  unsatisfactory  device  connectivity  and  security  vulnerabilities  made  evident 
by  devices’  inherently  mobile  nature  have  the  potential  to  seriously  compromise  first  responders’  effectiveness.  To  address  these 
shortcomings,  our  team  designed  and  built  WebBee,  a  secure  coordination  and  communication  infrastructure.  This  article  mil 
take  a  high-level  look  at  WebBee ’s  architecture,  and  examine  some  interesting,  non-trivial  sample  applications  we  have  deployed 
on  top  of  it. 


Ever  since  the  September  11,  2001  ter¬ 
rorist  attacks,  the  United  States  has 
been  re-evaluating  coordination  for  first 
responders  in  disaster  scenarios.  First 
responders  must  communicate  reliably 
and  securely  in  times  of  crisis.  However, 
communication  channels  such  as  cell 
phone  networks  may  be  impaired  or 
destroyed  during  disaster  scenarios.  Even 
if  communication  was  technically  feasible 
through  these  channels,  extreme  conges¬ 
tion  might  render  them  useless  for  first 
responders.  Another  problem  is  that  these 
channels  are  more  vulnerable  to  compro¬ 
mise:  A  malicious  agent  could  steal  a  first 
responder’s  cell  phone  and  intercept  com¬ 
munications.  This  can  seriously  under¬ 
mine  a  first  responder’s  effectiveness  in 
crisis  situations. 

The  first  responders  have  three  prima¬ 
ry  needs.  They  must  be  able  to  communi¬ 
cate  using  devices  they  likely  already  have 
and  are  well-accustomed  with.  Secondly, 
the  communication  channel  must  be 
secure  in  mobile  environments.  Finally, 
while  in  a  time  of  crisis,  the  consumer 
communication  infrastructure  can  some¬ 
times  be  used,  it  cannot  be  relied  upon 
solely.  WebBee  addresses  each  of  these 
concerns. 

Architecture 

There  are  three  major  components  of  the 
WebBee  architecture  (as  shown  in  Figure  1 
on  the  following  page):  the  instant  infra¬ 
structure,  the  WebBee  coordination  server,  and 
the  database  server.  The  system  has  been 
designed  so  that  components  can  be  dis¬ 
tributed  across  different  machines. 

Certain  field  personnel  are  equipped 
with  battery-operated  instant  infrastruc¬ 
ture  backpack  units.  Equipment  is  com¬ 
mercial  off-the-shelf  hardware,  so  very 
large  numbers  of  personnel  can  be  outfit¬ 
ted  easily.  Custom  SMesh  software  [1] 
helps  maximize  connectivity  by  dynami¬ 
cally  reorganizing  the  network  topology  as 
personnel  move  about  the  field.  The 
WebBee  coordination  server  is  an  abstrac¬ 


tion  of  several  components  that  coordi¬ 
nate  request  handling,  challenge-response 
management,  policy  examination,  applica¬ 
tion  hosting,  and  message  dispatching. 
The  database  server  manages  all  data 
interactions. 

WebBee  Coordination  Server 

Component  Detail 
WebBee  Master  Server  and 
Challenge  Server  Interaction 

The  WebBee  master  server  negotiates 
traffic  from  clients  between  the  challenge 
server  and  the  application  bridge.  When  a 

[communication] 
channels  are  more 
vulnerable  to  a 
compromise :  a  malicious 
agent  could  steal  a  first 
responder's  cell  phone 
and  intercept 
communications.  This  can 
seriously  undermine  a 
first  responder's 
effectiveness  ...” 

client  request  comes  in,  the  WebBee  mas¬ 
ter  server  stores  it  and  asks  the  challenge 
server  whether  the  client  needs  to  be 
challenged.  If  the  challenge  server  deter¬ 
mines  no  challenge  is  needed,  it  tells  the 
WebBee  master  server  that  it  is  OK  to 
proceed.  Otherwise,  the  challenge  server 
issues  a  challenge  through  the  master 
server  to  the  client.  The  client’s  solution 
is  sent  back  through  the  master  server  to 


the  challenge  server.  If  it  is  invalid,  the 
challenge  server  informs  the  master  serv¬ 
er  that  no  action  is  to  be  taken  and  the 
client  is  informed  that  the  request  was 
denied.  If  the  solution  is  valid,  the 
WebBee  master  server  retrieves  the 
client’s  most  recent  request  and  dispatch¬ 
es  it  to  the  application  bridge.  Our 
model,  therefore,  assumes  that  clients  will 
only  ever  need  a  single  request  serviced  at 
a  time. 

Security 

Our  security  mechanism  is  broken  into 
three  separate  subsystems:  the  challenge 
server,  upload  security,  and  download 
security.  All  are  wrapped  in  a  secure  sock¬ 
ets  layer. 

The  Challenge  Server 

The  challenge  server’s  job  consists  of 
policies  and  challenges.  Policies  encode  con¬ 
ditions  under  which  challenges  are 
required,  and  are  arranged  in  a  hierarchy: 
If  an  agent  passes  one  policy,  there  may 
still  be  subsequent  policies  that  must  be 
evaluated.  The  policy  scheme  for  the 
WebBee  coordination  server  is  depicted 
in  Figure  2  (see  next  page). 

The  first  policy  here  is  an  application- 
level  test.  This  special  policy  grants  full 
access  to  certain  applications,  and 
demonstrates  that  WebBee  supports  both 
secure  and  non-secure  applications.  If 
the  application  must  be  challenged,  a  tem¬ 
poral  policy  is  activated  to  determine  if 
the  client’s  last  challenge-response  has 
expired.  If  it  has  expired,  the  client  is 
issued  a  challenge.  The  last  policy  is  a 
geospatial  policy:  If  the  user  has  strayed  far 
away  from  the  set  of  last  known  global 
positioning  system  (GPS)  coordinates, 
the  client  is  challenged. 

Policy  intervals  can  be  defined  on  a 
per-user  basis,  based  on  the  level  of  secu¬ 
rity  required  for  each  client.  At  most,  one 
challenge  will  occur  through  a  traversal  of 
this  policy  flowchart.  Once  the  client 
solves  the  challenge,  his  or  her  GPS  coor- 
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Figure  2:  Policy  Flowchart  for  the  WehBee  Coordination  Server 
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Figure  3:  Amanda ,  Bob \  and  Carl  Initially  All  Have  Halid  Key  shares 


dinates  and  a  timestamp  are  stored  in  the 
database. 

When  the  policy  flowchart  deter¬ 
mines  that  a  challenge  is  required,  the 
server  randomly  selects  one  of  several 
possible  challenges  and  issues  it  to  the 
client.  If  the  client  solves  it,  then  the 
request  is  serviced.  Otherwise,  the  cur¬ 
rent  and  all  subsequent  requests  will  also 
be  denied  until  the  client  successfully 
solves  the  original  challenge.  This  elimi¬ 
nates  malicious  clients’  ability  to  game  the 
system  by  exploring  the  challenge  space. 

Currently,  only  text-based  (e.g.,  pass¬ 
word)  challenges  have  been  implement¬ 
ed.  With  the  right  hardware,  the  challenge 
system  could  be  extended  to  issue  other 
kinds  of  challenges,  such  as  biometric 
challenges  (e.g.,  fingerprint,  voice,  and / or 
retinal  scanning). 

Upload  Security 

In  our  scalable  crisis  management  sys¬ 
tem,  we  are  assuming  that  there  are  many 
downloads  but  relatively  few  uploads. 
With  this  in  mind,  we  have  decomposed 
our  security  requirements  into  upload 
and  download  security  characteristics. 

For  upload  security,  if  a  handheld  is 
lost,  we  want  to  ensure  that  (1)  data  that 
has  already  been  posted  cannot  be  repudi¬ 
ated,  and  (2)  data  cannot  be  post-dated. 
Our  forward  secure  signatures  use  a  pri¬ 
vate  key  that  evolves  as  a  function  of  time; 
the  public  key,  however,  remains  the  same. 
This  kind  of  forward-secure  scheme  was 
proposed  by  Anderson  [2]  and  imple¬ 
mented  by  Bellare,  Mihir,  and  Miner  [3] . 

Download  Security:  The  Quorum 
System 

For  download  security,  scaling  is  an 
important  issue.  For  clients,  we  want  to 
require  relatively  few  of  their  staff  to 
have  to  acquire  new  keys  during  a  change 
(e.g.,  departure  or  loss  of  device).  The 
quorum  system  implements  download 
security  with  these  kinds  of  scalability 
concerns  in  mind. 

In  the  quorum  system,  agents  need  to 
have  a  minimum  number,  k,  of  key  shares 
to  securely  read  a  message.  At  initializa¬ 
tion,  each  agent  receives  m  keyshares, 
where  m  >  k,  from  a  global  key  share  set 
consisting  of  a  total  of  s  keyshares.  If  a 
user  leaves,  his  or  her  shares  are  invali¬ 
dated  for  all  users.  When  a  user  has  fewer 
than  k  valid  shares,  they  must  obtain  a 
new  set  of  valid  keyshares  from  the  glob¬ 
al  keyshare  collection. 

When  the  server  broadcasts  a  mes¬ 
sage,  it  first  encrypts  it  under  a  message 
key.  This  key,  in  turn,  is  itself  encrypted  s 
times.  The  r-encrypted  message  keys  and 
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the  encrypted  message  are  sent  to  all 
agents  who  decrypt  the  message  keys 
using  their  personal  keysets.  If  exactly  k 
of  the  keys  are  identical,  it  is  valid  and  the 
agent  proceeds  to  decrypt  the  encrypted 
message  with  that  decrypted  message  key 

Figures  3,  4,  and  5  depict  a  scenario  in 
which  k  —  3  and  m  —  5.  In  Figure  3, 
Amanda,  Bob,  and  Carl  all  have  a  quorum 
of  valid  keyshares.  In  Figure  4,  when  Bob 
leaves,  three  of  Amanda’s  keyshares  are 
invalidated,  forcing  her  to  obtain  new 
shares.  Carl  only  has  two  shares  invalidat¬ 
ed;  he  can  continue  to  operate.  Figure  5 
depicts  the  scenario  in  which  Amanda 
has  reported  a  lost  or  stolen  handheld,  in 
which  case  all  of  Amanda’s  keyshares  are 
invalidated.  In  this  instance,  Carl  must 
reacquire  new  keyshares  to  operate. 

Application  Bridge 

The  application  bridge  dispatches 
requests  to  the  appropriate  application 
daemon  via  an  ID  embedded  in  the 
request  header.  If  a  response  is  generat¬ 
ed,  it  is  sent  back  through  the  WebBee 
master  server  to  the  client.  Gas  Prices, 
Event  Reports,  and  Agent  Contingency 
and  Action  Coordinator  (AC2)  are  three 
applications  we  have  built  using  the 
WebBee  framework. 

Gas  Prices 

The  Gas  Prices  application  allows  clients 
to  determine  the  gas  stations  with  the 
cheapest  prices.  A  client  initially  sends  a 
request  containing  his  or  her  GPS  coor¬ 
dinates.  The  Gas  Prices  daemon  con¬ 
structs  a  map  through  an  implementation 
of  the  U.S.  Census  Bureau’s  Topologically 
Integrated  Geographic  Encoding  and 
Referencing  (TIGER)  geographic  infor¬ 
mation  system  (GIS)  database  [4],  then 
queries  a  Web  site  that  publishes  up-to- 
date  gas  prices  and  sends  it  back  to  the 
client. 

Gas  Prices  and  other  applications  use 
the  WebBee  scraping  engine  to  obtain 
data  from  the  Web.  For  each  application, 
a  scraping  script  identifies  the  data  compo¬ 
nents  of  interest  in  a  Web  page.  Any  sta¬ 
tic  or  dynamic  data  can  be  acquired — 
including  text,  images,  and  audio. 

Event  Reports 

The  Event  Reports  application  (see  Figure 
6,  next  page)  allows  clients  to  log  incidents 
that  they  observe  in  the  field.  Other 
clients  are  notified  about  these  incidents 
only  once  they  become  geospatially  rele¬ 
vant.  Clients  specify  details  about  an  inci¬ 
dent  by  typing  out  a  short  message — as 
well  as  a  radius  in  meters — on  the  hand¬ 
held  device.  As  other  clients  move  in 


range,  their  handhelds  are  notified  via  the 
short  messaging  service  (SMS).  This 
relieves  clients  of  having  to  sift  through 
reports  to  determine  which  are  immedi¬ 
ately  important,  enabling  him  or  her  to 
react  faster  and  more  effectively. 

A  scenario  is  shown  in  Figure  7  (see 
next  page).  A  report  about  a  fire  at  the 
Chicago  Mercantile  Exchange  (A)  is  sub¬ 
mitted.  One  fire  department  unit  (B)  and 
two  police  department  units  (C)  and  (F) 
receive  the  alert  about  the  fire.  Another 
report  about  an  unrelated  incident  is  sub¬ 
mitted  by  an  informant  across  the  city 


(D).  Here,  one  fire  department  unit  is 
alerted  (E),  as  is  one  police  department 
unit  (F).  Notice  that  (F)  receives  alerts 
about  both  incidents  since  it  is  in  range  of 
both.  By  contrast,  another  police  depart¬ 
ment  (G)  receives  no  alerts.  As  soon  as  G 
moves  into  range  (if  ever),  they  will 
receive  the  report. 

Event  Reports  -  Exploiting  Database 
Triggers  for  Better  Performance 

Report  notifications  to  clients  are  imple¬ 
mented  through  database  triggers.  The 
WebBee  database  server  contains  an 


Figure  4:  Bob  leaves 
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Figure  6:  Mobile  Client  Screenshots  for  the  Event  Reports  System 


information  server,  which  is  a  Postgres 
database  with  a  PostGIS  [5]  extension 
that  is  integrated  with  an  instance  of  a 
visualization  server  in  an  application  dae¬ 
mon.  The  visualization  server  renders 
map  data  for  visualization  [6]  in  concert 
with  an  instance  of  a  TIGER  database 
[4].  When  a  client  enters  an  event  report 
region,  the  database  triggers  the  insertion 
of  a  new  record  into  a  special  table. 
Meanwhile,  the  event  reports  daemon 
monitors  this  table.  If  there  are  any  new 


entries,  the  daemon  creates  an  SMS  and 
sends  it  to  the  target  user.  The  heavy  lift¬ 
ing  for  this  mechanism  is  done  through 
an  extension  of  Postgres  triggers 
(Figures  8  and  9  show  an  example  for 
alpha-numeric  and  spatial  range  triggers), 
resulting  in  fewer  queries  and  better  per¬ 
formance. 

Trigger  support  in  Postgres  is  table- 
based  and  comparatively  primitive:  with  n 
table  triggers,  an  update  will  cause  n  oper¬ 
ations  to  occur,  resulting  in  decreased 


Figure  7:  An  Example  Event  Reports  Scenario 


performance  if  updates  are  frequent. 
Also,  Postgres  does  not  provide  out-of- 
the-box  support  for  multi-table  triggers. 
This  becomes  a  problem,  for  example, 
with  mixed  notifications. 

To  address  these  problems,  we  have 
implemented  a  tripper  meta  table ,  which 
encodes  relationships  between  trigger 
class  identifiers  and  ownership,  and  is  ref¬ 
erenced  before  trigger  evaluations.  Con¬ 
sider  the  mixed  notification:  “NOTIFY 
me  WHEN  I  come  WITHIN  2  miles  of  a 
gas  station  WITH  a  gas  price  LOWER 
THAN  $3.50.”  When  the  user’s  location  is 
updated,  the  trigger  meta  table  is  exam¬ 
ined  on  the  user  ID  trigger  class  identifier. 
When  gas  prices  are  updated,  entries  in 
the  meta  table  are  examined  on  the  gas 
station  ID  and  the  trigger  class  identifier. 
Performance  is  up  to  eight  times  faster 
than  without  the  meta  table  for  alpha¬ 
numeric  triggers  (Figure  8),  and  up  to  10 
times  faster  for  spatial  range  triggers 
(Figure  9).  Performance  increases  as  the 
total  number  of  triggers  increases. 

Agent  Contingency  and  Action 
Coordinator 

Another  application  that  we  have  built  is 
an  AC2  application,  which  provides  a  full- 
text,  voice,  and  picture  messaging  system. 
Messages  may  be  sent  directly  to  individ¬ 
ual  clients  or  by  radius.  The  radius  message 
mechanism  works  as  follows:  The  sender 
specifies  his  or  her  GPS  coordinates  and 
radius  in  meters  within  the  message  head¬ 
er.  When  the  message  is  sent  to  the  serv¬ 
er,  all  agents’  last  known  GPS  coordinates 
are  examined.  The  message  is  sent  to  all 
agents  in  the  defined  circle.  Radius  mes¬ 
saging  might  be  useful,  for  example,  for 
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Figure  8:  Meta  Table  Performance  Comparison  for  Alpha-Numeric 
Triggers 
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the  dissemination  of  orders  to  all  agents 
within  a  specific  location. 

Another  innovative  feature  of  AC2  is 
message  withdrawal.  If  a  client  has  sent  a 
message  and  then  later  circumstances 
change  and  they  no  longer  want  the  mes¬ 
sage  to  be  read  by  other  agents,  they  can 
withdraw  the  message;  it  will  be  removed 
from  the  inboxes  of  all  agents  to  whom 
the  user  sent  it.  This  is  useful  in  situations 
in  which  agents  have  decided  a  reported 
incident  has  stopped  being  of  interest. 
For  example,  if  an  agent  initially  reports 
seeing  a  suspicious  package,  but  later 
determines  that  it  is  not  a  threat,  they  can 
withdraw  the  message  to  prevent  confu¬ 
sion  among  the  other  agents.  All  mes¬ 
sages — including  withdrawn  messages — 
persist  in  the  WebBee  server  log  so  as  to 
provide  a  traceable  audit  trail. 

Conclusion 

WebBee  is  a  robust,  mobile,  scalable  com¬ 
munications  and  coordination  framework 
that  can  handle  several  applications  at  var¬ 
ious  levels  of  security.  The  challenge- 
response  and  quorum  systems  are  scalable 
mobile  security  paradigms  that  are  appro¬ 
priate  for  our  system.  The  implementation 
of  a  policy  hierarchy  strikes  a  nice  balance 
between  client  situation-dependent  securi¬ 
ty  and  future  extensibility.  Finally,  database 
optimizations — like  trigger  meta  tables 
and  streamlined  indexing — impart  signifi¬ 
cant  performance  gains  to  our  system. ♦ 
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Constructing  Change-Tolerant  Systems 
Using  Capability-Based  Design 
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Targe-scale,  complex  emergent  systems  demand  extended  development  life  cycles.  Tnfortunately,  the  inescapable  introduction 
of  change  over  that  period  of  time  often  has  a  detrimental  impact  on  quality,  and  tends  to  increase  associated  development 
costs.  In  this  article,  we  describe  a  capability-based  approach  to  evolving  change-tolerant  systems;  that  is,  systems  whose  enti¬ 
ties  (or  capabilities)  are  highly  cohesive,  minimally  coupled,  and  exhibit  balanced  levels  of  abstraction. 


The  widespread  advancements  in  tech¬ 
nology  have  encouraged  the  demand 
for  large-scale  problem  solving.  This  has 
resulted  in  substantial  investments  of 
time,  money,  and  other  resources  for  com¬ 
plex  engineering  projects  such  as  hybrid 
communication  systems,  state-of-the-art 
defense  systems,  and  technologically 
advanced  aeronautics  systems.  Unfortu¬ 
nately,  the  expenditures  are  belied  by  the 
failure  of  such  systems.  Plagued  by  evolv¬ 
ing  needs,  volatile  requirements,  market 
vagaries,  technology  obsolescence,  and 
other  factors  of  change,  a  large  number  of 
projects  are  prematurely  abandoned  or  are 
catastrophic  failures  [1,  2,  3].  The  inherent 
complexity  of  these  systems,  compounded 
by  their  lengthy  development  cycles,  is  fur¬ 
ther  exacerbated  by  utilizing  development 
methods  that  are  hostile  to  change. 
Moreover,  this  complexity  often  results  in 
emergent  behavior  [4]  that  is  unexpected. 
For  example,  the  introduction  of  a  new 
functionality  in  the  system  can  result  in 
unanticipated  interactions  with  other  exist¬ 
ing  components  that  can  be  detrimental  to 
the  overall  system  functionality. 

More  recently,  techniques  such  as  the 
performance-based  specifications  (PBSs) 
[5,  6]  and  capability-based  acquisition 
(CBA)  [7]  are  being  utilized  to  mitigate 
change  in  large-scale  systems.  PBSs  are 
requirements  describing  the  outcome 
expected  of  a  system  from  a  high-level 
perspective.  The  less  detailed  nature  of 
these  specifications  provides  latitude  for 
incorporating  appropriate  design  tech¬ 
niques  and  new  technologies.  Similarly, 
CBA  is  expected  to  accommodate  change 
and  produce  systems  with  relevant  capa¬ 
bility  and  current  technology.  It  does  so  by 
both  delaying  requirement  specifications 
in  the  software  development  cycle  and 
allowing  time  for  a  promising  technology 
to  mature  so  that  it  can  be  integrated  into 
the  software  system.  However,  the  PBS 
and  CBA  approaches  lack  a  scientific  pro¬ 
cedure  for  deriving  system  specifications 
from  an  initial  set  of  user  needs.  More¬ 


over,  they  neglect  to  define  the  level  of 
abstraction  at  which  a  specification  or  a 
capability  is  to  be  described.  Thus,  these 
approaches  propose  solutions  that  are  not 
definitive,  comprehensive,  or  mature 
enough  to  accommodate  change  or  bene¬ 
fit  the  development  process  for  complex 
emergent  systems. 

In  order  to  function  acceptably  over 
time,  complex  emergent  systems  must 
accommodate  the  effect  of  dynamic  fac¬ 
tors — such  as  varying  stakeholder  expec- 


“...  changes  can  be 
achieved  with  minimum 
impact  if  systems  are 
architected  using 
aggregates  that  are 
embedded  with 
change-tolerant 
characteristics.” 

tations,  changing  user  needs,  advancing 
technology,  scheduling  constraints,  and 
market  demands — during  their  lengthy 
development  periods.  We  conjecture  that 
these  changes  can  be  achieved  with  mini¬ 
mum  impact  if  systems  are  architected 
using  aggregates  that  are  embedded  with 
change-tolerant  characteristics.  Such  ag¬ 
gregates  are  defined  as  capabilities. 

Capabilities  are  functional  abstractions 
that  populate  the  space  between  needs  and 
requirements.  As  such,  they  (a)  are  more 
rigorously  defined  than  user  needs,  (b) 
retain  crucial  context  information  inher¬ 
ent  to  the  problem  space,  but  at  the  same 
time  (c)  avoid  solution  specification  com¬ 
mitments  ascribed  to  requirements. 
Capabilities  are  constructed  so  that  they 
exhibit  high  cohesion,  low  coupling,  and 


balanced  abstraction  levels.  The  property 
of  high  cohesion  helps  localize  the  impact 
of  change  to  within  a  capability.  Also,  the 
ripple  effect  of  change  is  less  likely  to 
propagate  beyond  the  affected  capability 
because  of  its  reduced  coupling  with 
neighboring  capabilities  [8].  The  balanced 
level  of  abstraction  assists  in  understand¬ 
ing  the  embedded  functionality  in  terms 
of  its  most  relevant  details  [9].  Addition¬ 
ally,  we  observe  that  the  abstraction  level 
is  related  to  the  size  of  a  capability;  the 
higher  the  abstraction  level,  the  greater  the 
size  of  a  capability  [10].  From  a  software 
engineering  perspective,  abstractions  with 
a  smaller  size  are  more  desirable  for  imple¬ 
mentation. 

Capabilities  are  generated  using  a 
capabilities  engineering  (CE)  process. 
Specifically,  this  approach  employs  a 
unique  algorithm  and  a  set  of  well- 
defined  metric  computations  that  exploit 
the  principles  of  decomposition,  abstrac¬ 
tion,  and  modularity  to  identify  functional 
aggregates  (i.e.,  capabilities).  Such  capabil¬ 
ities  embody  the  desirable  software  engi¬ 
neering  attributes  of  high  cohesion,  low 
coupling  and  balanced  abstraction  levels. 
Change-tolerance  is  achieved  through  the 
embodiment  of  such  attributes.  The  inte¬ 
gration  of  the  CE  process  with  existing 
development  paradigms,  and  the  exploita¬ 
tion  of  enhanced  traceability  that  accom¬ 
panies  it,  are  expected  to  reveal  more 
effective  methods  for  designing,  building, 
and  maintaining  software  for  real-world 
systems.  This  results  in  a  capability-based 
system  specification  that  is  change-toler¬ 
ant,  permitting  a  just-in-time  specification 
of  requirements,  and  an  incremental 
development  cycle  that  can  span  long 
periods  of  time. 

The  CE  Process 

The  problem  of  changing  requirements, 
especially  in  developing  large  complex  sys¬ 
tems,  is  well  established  [11].  Software 
development  processes  that  are  ill- 
equipped  to  accommodate  change  are  pri- 
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Figure  1 :  The  CE  Process 


marily  afflicted  with  requirements  volatili¬ 
ty  [12].  This  phenomenon  is  known  to 
increase  the  defect  density  and  affect  pro¬ 
ject  performance  resulting  in  schedule  and 
cost  overruns  [2,  13].  Traditional  require¬ 
ments  engineering  (RE)  strives  to  manage 
volatility  by  baselining  requirements. 
However,  the  dynamics  of  user  needs  and 
technology  advancements  during  the 
extended  development  periods  of  com¬ 
plex  emergent  systems  discourage  fixed 
requirements. 

Our  approach,  the  CE  process,  builds 
change-tolerant  systems  on  the  basis  of 
optimal  sets  of  capabilities.  Figure  1  illus¬ 
trates  the  two  major  phases  of  the  CE 
process.  Phase  I  identifies  sets  of  capabil¬ 
ities  based  on  the  values  of  cohesion,  cou¬ 
pling,  and  abstraction  levels.  Phase  II,  a 
part  of  our  ongoing  research,  further  opti¬ 
mizes  these  initial  sets  of  capabilities  to 
accommodate  schedule  constraints  and 
technology  advancements.  The  CE  pro¬ 
cess  is  discussed  further  in  the  following 
section. 

The  capabilities  identification  algo¬ 
rithm  (also  described  in  the  following  sec¬ 
tion)  employs  measures  of  cohesion,  cou¬ 
pling,  and  abstraction  to  identify  candidate 
sets  of  capabilities  that  necessarily  and 
sufficiently  embody  the  desired  system 
functionality.  Once  identified,  they  can  be 
further  optimized  to  suit  schedule  and/or 
technology  constraints;  but  because  capa¬ 
bilities  are  formulated  from  user  needs, 
our  efforts  required  focus  on  needs  analy¬ 
sis,  a  phase  prior  to  requirements  specifi¬ 
cation.  At  this  point,  we  consider  only  the 
functional  aspects  of  a  system. 

Computing  Capabilities:  The 
Algorithm 

Capabilities  are  determined  mathematical¬ 
ly  from  a  function  decomposition  (FD) 
graph  (see  Figure  2).  This  is  an  acyclic 
directed  graph,  implicitly  derived  from 
user  needs,  and  represents  system  func¬ 
tionality  at  various  levels  of  abstraction. 
The  highest  abstraction  level,  represented 
by  the  root  node,  connotes  the  mission  of 
the  system;  the  lowest  levels  of  abstrac¬ 
tion  (i.e.,  the  leaves),  represent  directives. 
Directives  are  low-level  characteristics  of 
the  system  formulated  in  the  language  of 
the  problem  domain.  They  differ  from 
requirements  in  that  requirements  are  for¬ 
mulated  using  language  and  terminology 
inherent  to  the  more  technically  oriented 
solution  domain.  Thus,  capabilities  are 
identified  after  the  elicitation  of  needs  but 
prior  to  the  formalization  of  technical  sys¬ 
tem  requirements.  This  unique  spatial 
positioning  permits  the  definition  of 


capabilities  to  be  independent  of  any  par¬ 
ticular  development  paradigm.  We  envi¬ 
sion  that  by  doing  so,  capabilities  can 
bridge  the  chasm  between  the  problem 
and  the  solution  space,  also  described  as 
the  complexity  gap  [14].  It  is  recognized  that 
this  gap  is  responsible  for  information 
loss,  misconstrued  needs,  and  other  detri¬ 
mental  effects  that  plague  system  develop¬ 
ment  [15,  16]. 

To  identify  capabilities,  we  need  to 
examine  all  possible  functional  abstrac¬ 
tions  of  a  system  represented  in  the  FD 
graph.  Intuitively,  the  algorithm  for  com¬ 
puting  the  desired  set  of  capabilities  is  a 
five-step  process  that  produces  slices 
through  the  FD  graph.  We  define  a  slice  to 
be  any  subset  of  interior  nodes  of  the  FD 
graph  such  that  their  respective  frontiers 
uniquely  cover  all  directives.  We  select  the 
slice  containing  the  set  of  interior  nodes 
that  are  maximally  cohesive,  minimally 
coupled,  and  exhibit  balanced  levels  of 
abstraction.  In  effect,  this  slice  contains 
the  desired  set  of  capabilities. 

The  following  sub-sections  outline  the 
process  for  identifying  the  slice  containing 
the  desired  set  of  capabilities. 


Step  I:  Constructing  the  Functional 
Decomposition  Graph 

An  FD  graph  represents  functional 
abstractions  of  the  system  obtained  by  the 
systematic  decomposition  of  user  needs. 
A  need  at  the  highest  level  of  abstraction 
is  the  mission  of  the  system  and  is  repre¬ 
sented  by  the  root.  We  use  the  top-down 
philosophy  to  decompose  the  mission  into 
functions  at  various  levels  of  abstraction. 
We  claim  that  a  decomposition  of  needs  is 
equivalent  to  a  decomposition  of  func¬ 
tions  because  a  need  essentially  represents 
some  functionality  of  the  system. 
Formally,  we  define  an  FD  graph  G  — 
(V,E)  as  an  acyclic  directed  graph  where 

V  is  the  vertex  set  and  E  is  the  edge  set. 

V  represents  the  system’s  functionality: 
Leaves  represent  directives,  the  root  sym¬ 
bolizes  the  mission,  and  internal  nodes 
indicate  system  functions  at  various 
abstraction  levels.  Similarly,  the  edge  set  E 
comprises  edges  that  depict  decomposi¬ 
tion,  intersection,  or  refinement  relation¬ 
ships  among  nodes.  These  edges  are  illus¬ 
trated  in  Figure  2.  An  edge  between  a  par¬ 
ent  and  its  child  nodes  represents  function¬ 
al  decomposition  and  implies  that  the 


Figure  2:  Example  FD  Graph  G  —  (V,E) 
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Impact  of  Directive 
Omission 

Description  of  Impact  on  Associated 

Parent  Task 

Relevance 
of  Directive 

Catastrophic 

Task  failure 

1.00 

Critical 

Task  success  questionable 

0.70 

Marginal 

Reduction  in  performance 

0.30 

Negligible 

Non-operational  impact 

0.10 

Table  1:  Relevance  Values 


functionality  of  the  child  is  a  proper  sub¬ 
set  of  the  parent’s  functionality.  Only 
internal  (non-leaf)  nodes  with  an  outde- 
gree  of  at  least  2  can  have  valid  decompo¬ 
sition  edges  with  their  children.  The 
refinement  edge  is  used  when  there  is  a 
need  to  express  a  node’s  functionality  with 
more  clarity,  say,  by  furnishing  additional 
details.  A  node  with  an  outdegree  of  1 
symbolizes  this  type  of  relationship  with 
its  child  node.  To  indicate  the  commonal¬ 
ities  between  functions  defined  at  the 
same  level  of  abstraction,  the  intersection 
edge  is  used.  Hence,  a  child  node  with  an 


indegree  greater  than  1  represents  a  func¬ 
tionality  common  to  all  its  parent  nodes. 
The  FD  graph  utilizes  these  definitions  to 
provide  a  structured  top-down  representa¬ 
tion  of  system  functionality,  thereby  facil¬ 
itating  the  formulation  of  capabilities  in 
terms  of  their  cohesion,  coupling,  and 
abstraction  values.  We  discuss  those  com¬ 
putational  mechanics  next. 

Step  2:  Computing  Cohesion  and 
Coupling  Values 

Only  interior  nodes  are  considered  as 
capability  candidates.  For  each  interior 


node,  its  cohesion  value  is  directly  propor¬ 
tional  to  how  important  its  children  nodes 
are  to  achieving  its  defined  functionality. 
Coupling,  on  the  other  hand,  is  a  pair-wise 
relationship  between  two  interior  nodes 
and  reflects  the  probability  that  a  change 
in  one  node  will  have  an  impact  on  the 
other.  The  cohesion  value  for  each  node 
and  the  coupling  value  for  each  set  of 
pair-wise  nodes  are  computed  using  the 
FD  graph  G,  and  these  measures  are 
described  next. 

•  Cohesion.  The  cohesion  of  a  node  is 
computed  as  an  average  of  the  rele¬ 
vance  values  of  the  participating  direc¬ 
tives.  The  relevance  values  are  assigned 
based  on  the  values  listed  in  Table  1. 
However,  we  make  a  distinction 
between  the  parent  and  ancestor  nodes 
of  a  directive.  In  order  to  reduce  the 
need  for  user  input,  we  elicit  the  rele¬ 
vance  value  of  a  directive  only  with 
respect  to  its  parent  node.  Figure  2 
illustrates  relevance  values  of  direc¬ 
tives  to  their  parents. 

Assuming  that  each  directive  can 
be  associated  with  a  unique  parent 
node  (the  validity  of  this  assumption  is 
established  in  [17]),  then  the  cohesion 
for  any  node  n  can  be  computed  (as 
shown  in  Figure  3). 

•  Coupling.  To  measure  coupling,  we 
need  information  about  dependencies 
between  system  functionalities.  By 
virtue  of  its  construction,  the  struc¬ 
ture  of  the  FD  graph  represents  the 
relations  between  different  aggregates. 
In  particular,  we  compute  coupling 
between  two  nodes  in  terms  of  their 
directives.  Two  directives  are  said  to  be 
coupled  if  a  change  in  one  affects  the 
other.  We  compute  this  effect  as  the 
probability  that  such  a  change  occurs 
and  propagates  along  the  shortest  path 
between  them.  Note  that  the  coupling 
measure  is  asymmetric. 

For  two  nodes  p  and  q,  the  cou¬ 
pling  between  them  is  computed  (as 
shown  in  Figure  4). 

Step  3:  Identifying  the  Candidate  Set 
of  Slices 

Recall  that  slices  are  sets  of  nodes  that 
necessarily  and  sufficiently  cover  all  direc¬ 
tives  identified  in  the  FD  graph. 
Moreover,  no  slice  contains  the  mission 
node  (M).  We  compute  all  possible  slices 
and  then  rank  them.  The  first  ranking 
(high  to  low)  is  based  on  the  average  cohe¬ 
sion  of  each  slice’s  constituent  nodes;  the 
second  ranking  (low  to  high)  is  based  on 
the  average  coupling  of  each  slice’s  con¬ 
stituent  nodes.  The  top  10  slices  (an  arbi¬ 
trary  count)  common  to  both  sets  are  then 


Figure  3:  Equation  for  Computing  Node  Cohesion 

(a)  if  node  n  has  only  directives  as  its  children,  then  its  cohesion  is  the  arithmetic  mean  of 
the  relevance  values  of  the  associated  directives,  i.e.: 


^  (relevance  value  for  directive  i) 


Coh  (n)  = 


for  each  directive  i 
associated  with  node  n 


total  #  of  directives  associated  with  node  n 


(b)  for  all  other  nodes: 


^  [(#  of  directives  associated  with  child  i) ^(cohesion  of  child  /)] 


Coh  (n)  = 


for  each  immediate  child  i 
associated  with  node  n 


^  (#  of  directives  associated  with  child  i) 


for  each  immediate  child  i 
associated  with  node  n 


Figure  4:  Equation  for  Computing  Coupling  between  Nodes 


^  (coupling  between  directive  i  and  directive  j) 


CPnode  M 


for  each  pairwise  directive  i  and  j 
associated  with  nodes  p  and  q,  respectively 


total  number  of  directives'^  *  /  total  number  of  directives\ 
associated  with  node  p  J  l  associated  with  node  q  J 

where  the  coupling  between  two  directives  /  and  j  is  computed  as: 

(probability  that  directive  j  will  change) 


Cp directive  (h7*) 


/  length  of  shortest  path  \ 

1 1 connecting  directive  i  and  directive  j  I 


and  where  the  probability  that  directive  j  will  change  is  computed  as: 

PrbChg  (j)  =  1 


/  total  #  of  directives  associated  \ 
l  with  parent  node  of  directive  j  J 
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chosen  to  form  the  pruned  candidate  set 
and  represent  those  slices  that  have  the 
highest  average  cohesion  and  lowest  aver¬ 
age  coupling. 

Step  4:  Computing  Balanced 
Abstraction  Levels 

In  the  next  step,  we  individually  examine 
each  of  the  10  slices  with  the  objective  of 
iteratively  decomposing  constituent  nodes 
to  achieve  a  balanced  level  of  implemen¬ 
tation  abstraction.  The  decomposition 
process  consists  of  replacing  a  parent 
node  with  its  children  nodes.  We  observe 
that  as  nodes  are  decomposed  the  abstrac¬ 
tion  level  becomes  lower — that  is,  the 
node  sizes  decrease  but  the  coupling  val¬ 
ues  increase  (size  is  the  number  of  direc¬ 
tives  associated  with  an  interior  node).  We 
strive  to  identify  nodes  of  reduced  sizes  in 
line  with  the  principles  of  modularization, 
but  only  if  the  increase  in  coupling  is 
acceptable.  There  are  two  possible  scenar¬ 
ios  when  attempting  to  lower  the  abstrac¬ 
tion  level  of  a  node:  The  replacement 
(children)  nodes  have  lower-level  common 
functionality,  or  they  have  no  common 
functionality.  Referring  again  to  the  FD 
Graph  in  Figure  2,  suppose  that  one  of 
the  candidate  slices  is  {n1?  n4,  n5}. 

•  Common  Functionality.  Assume 
that  the  size  of  nt  is  too  large,  and 
hence,  we  attempt  to  reduce  its 
abstraction  level  to  its  children,  viz.  n2 
and  n3,  which  are  of  a  relatively  small¬ 
er  size.  We  observe,  however,  that 
these  nodes  share  a  common  function¬ 
ality  represented  by  n7.  This  implies 
that  one  of  the  links,  (n2,  n7)  or  (n3, 
n7),  needs  to  be  broken  in  order  to 
implement  n7  as  a  part  of  a  single-par¬ 
ent  capability.  Let  (n2,  n7)  be  broken, 
and  n7  be  implemented  as  a  part  of  n3. 
Consequently,  capabilities  n2  and  n3  are 
content-coupled  [16]  because  n2  may 
attempt  to  manipulate  the  n7  part 
embodied  in  n3.  Thus,  lowering  the 
abstraction  level  of  n1  results  in  capa¬ 
bilities  of  decreased  sizes,  but  with 
increased  coupling. 

•  No  Common  Functionality.  Now  we 

consider  the  reduction  of  n4  to  smaller- 
sized  nodes,  n9  and  n10.  Note  that  the 
proposed  reduction  has  no  commonali¬ 
ties.  We  observe  that  there  is  a  marginal 
increase  in  coupling,  but  that  nodes  n9 
and  n10  are  of  smaller  sizes  when  com¬ 
pared  to  n4.  Thus,  we  choose  n9  and  n10 
over  their  parent  n4.  We  are  willing  to 
accommodate  this  negligible  increase  in 
coupling  for  the  convenience  of 
increased  modularity,  a  decision  based, 
in  part,  on  subjective  evaluation. 

Hence,  we  iteratively  compute  the 


appropriate  abstraction  level  for  each  node 
in  the  set  of  slices  identified  in  Step  3,  and 
perform  the  appropriate  decomposition 
substitutions.  Because  the  nodes  selected 
for  abstraction  balancing  are  in  the  set  of 
slices  resulting  from  Step  3,  they  also 
exhibit  high  cohesion  and  low  coupling. 

Step  5:  Selecting  the  t(Optimar  Set 
of  Capabilities 

As  the  final  step,  we  re-compute  the  aver¬ 
age  coupling  and  cohesion  values  for  each 
of  the  10  slices.  The  slice  having  the  best  bal¬ 
ance  between  high  cohesion  and  low  coupling  is 
selected  as  the  set  of  capabilities  for  the  system. 

A  Validation  of  the  Current 
Work 

We  empirically  tested  the  hypothesis  that  a 
system  design  based  on  capabilities  is  more 
change- tolerant  than  a  design  generated 
from  the  traditional  RE  approach.  More 
specifically,  we  examined  the  impact  of 
changing  needs  on  the  RE-  and  CE-based 
designs  of  a  course  evaluation  system  [1 7] . 
The  original  high-level  design  of  this  sys¬ 
tem  is  based  on  an  RE  approach  and  is 
termed  RE-based  design.  The  CE-based 
design  was  constructed  using  a  capabilities 
approach  for  the  system.  To  determine  the 
optimal  capability  set,  we  constructed  an 
FD  graph  and  then  applied  the  algorithm 
described  earlier.  This  resulted  in  a  total  of 
1,495  slices,  from  which  the  slice  contain¬ 
ing  the  set  of  nodes  exhibiting  the  highest 
average  cohesion,  lowest  average  coupling, 
and  a  balanced  abstraction  level  was  select¬ 
ed  as  the  desired  capabilities  of  the  course 
evaluation  system.  The  CE-based  design 
was  constructed  based  on  the  chosen  capa¬ 
bility  set. 

The  RE-  and  CE-based  designs  were 
then  subjected  to  various  changes  in  needs. 
In  particular,  we  examined  the  impact  of 
six  different  needs’  changes  on  the  course 
evaluation  system.  An  example  of  a  need 
change  is,  “The  users  need  information 
about  the  handicapped-accessible  facilities 
for  courses  taught  in  Room  X.”  We  propa¬ 
gated  each  change  on  the  RE-  and  CE- 
based  designs  and  recorded  the  number  of 
affected  classes.  We  performed  the 
Wilcoxon  Signed-Rank  test,  the  non-para- 
metric  alternative  to  the  paired  t-test,  which 
results  in  a  P-value  of  0.018.  The  P-value 
indicates  the  probability  that  the  popula¬ 
tion  medians  of  the  number  of  affected 
classes  in  the  RE-  and  CE-based  designs 
are  different  because  of  chance.  The  very 
small  P-value  compels  us  to  reject  the  null 
hypothesis  that  the  change-tolerance  of  the 
system  is  indifferent  to  either  the  RE  or  the 
CE  approach.  Thus,  the  alternate  hypothe¬ 


sis  that  the  number  of  impacted  classes  in 
the  CE-based  design  is  significantly  less 
than  that  of  the  RE-based  design  is  true. 
This  result  is  in  agreement  with  our 
research  claim  that  the  change-tolerance  of 
a  system  improves  with  the  use  of  a  design 
based  on  capabilities. 

Summary 

The  current  and  proposed  work  addresses 
several  issues  associated  with  the  design, 
evolution,  and  emergent  behavior  of 
large-scale,  real-world  software  systems. 
As  stated  in  this  article,  CE  provides  a 
first-level  architectural  decomposition  of 
the  software  system.  Modularity  and  rea¬ 
soned  aggregation  are  cornerstones  for 
identifying  change-tolerant  functional 
units.  The  underlying  algorithm,  employ¬ 
ing  metric-based  computations,  extends 
needs  analysis  to  produce  sets  of  capabili¬ 
ties  enumerating  multiple  composition 
choices  and,  at  the  same  time,  indicates 
the  advantages/disadvantages  of  selecting 
one  set  over  the  other.  The  use  of  capa¬ 
bilities  also  permits  the  delayed  commit¬ 
ment  of  needs  to  requirements  which,  in 
turn,  support  the  integration  of  new  tech¬ 
nology  throughout  the  (extended)  soft¬ 
ware  development  effort.  Moreover,  be¬ 
cause  capabilities  are  designed  to  be  loose¬ 
ly  coupled,  they  facilitate  emergent  behav¬ 
ior  through  the  addition/ deletion  of  func¬ 
tionality  as  new  operational  conditions 
and  constraints  evolve.  Finally,  we  expect 
capabilities  to  support  earlier  architectural 
analysis,  leading  to  the  design  of  systems 
that  better  accommodate  non- functional 
requirements  like  performance,  security, 
and  reliability.  ♦ 
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Department  of  Defense 


The  Department  of  Defense  (DoD)  Business  Mission  Area  (BNIM)  accounts  for  roughly  half  of  the  DoD 
Information  Technology  (IT)  budget.  Many  of  the  DoD ’s  business  systems  have  been  in  use  for  years  and  are  straining 
to  support  the  agility  of  business  operations  necessary  today.  Ms  well,  many  new  systems  are  being  developed  on  such  a 
scale  that  it  takes  nearly  a  decade  to  produce  the  first  results.  A  potential  answer  to  this  situation  is  delivering  business 
capabilities  through  a  service-oriented  architecture  (SOM)1.  Much  of  the  private  sector  is  rapidly  moving  in  this  direc¬ 
tion.  The  question  is,  will  it  work  for  the  DoD  ?  This  article  is  about  the  results  of  market  research  conducted  by  the 
BMIM  Chief  Technical  Officer  (CTO)  and  Chief  Architect  (CM)  over  a  period  of  about  six  months  to  learn  about 
state-of-the-art  SOM  and  what  the  DoD  can  count  on  from  SOM  vendors  to  deliver  both  business  services  and  SOM 


infrastructure  in  the  near-  to  mid- 

The  DoD  is  perhaps  the  largest  and 
most  complex  organization  in  the 
world,  employing  nearly  1.4  million  people 
and  holding  approximately  $1.4  trillion  in 
assets.  IT  spending  for  business  support 
activities  in  the  DoD  BMA — funds  to 
operate,  maintain,  and  modernize  business 
systems — comprise  $15.7  billion  of  annu¬ 
al  DoD  IT  spending,  roughly  equal  to  the 
rest  of  the  federal  government. 

While  the  DoD  has  long  been 
acknowledged  for  its  premier  warfighting 
capabilities,  fragmentation  of  financial 
and  business  management  practices  leaves 
the  DoD  vulnerable  to  waste,  fraud,  and 
abuse,  as  well  as  risk  of  failure  on 
attempts  to  build  larger,  more  complex 
systems.  To  support  the  DoD  mission 
and  the  changing  nature  of  the  threats  to 


term. 

which  the  federal  government  must 
respond,  the  DoD  BMA  is  engaged  in  a 
massive  business  transformation.  It  must 
modernize  and  become  agile  in  order  to 
support  21st  century  national  security 
requirements.  The  BMA  CTO  evaluated 
DoD  BMA  enterprise  processes  and  asso¬ 
ciated  systems — including  human  re¬ 
source  and  personnel  management,  sup¬ 
ply  chain,  and  logistics,  as  well  as  financial 
and  accounting  management  functions — 
to  determine  the  best  strategy  for  achiev¬ 
ing  agility.  After  analysis  and  assessment 
of  BMA  objectives  and  study  of  the  over¬ 
all  direction  for  IT  within  the  DoD,  the 
strategy  selected  to  move  the  BMA  for¬ 
ward  is  the  adoption  of  an  SOA. 

The  U.S.  Government  Accountability 
Office  describes  an  SOA  as  an  “approach 


for  sharing  functions  and  applications 
across  an  organization  by  designing  them 
as  discrete,  reusable,  business-oriented  ser¬ 
vices”  [1].  Most  importantly,  an  SOA  is  a 
mechanism  by  which  business  capabilities 
can  be  aligned  with  the  technical  infra¬ 
structure  in  support  of  an  agile  business 
strategy.  The  DoD’s  SOA  vision  calls  for 
such  alignment  through  an  architecture  of 
discrete  components  called  services  deliver¬ 
ing  business  capabilities  deployed  in  a  sup¬ 
portive  infrastructure  designed  for  this  pur¬ 
pose.  The  Office  of  the  BMA  CTO  and 
CA  collaborated  with  the  chief  informa¬ 
tion  officers  (CIOs)  representing  the  mili¬ 
tary  services,  defense  agencies,  combatant 
commands,  mission  areas,  and  DoD 
Enterprise  Services  to  develop  an  SOA 
strategy,  including  a  supporting  environ- 


Figure  1 :  The  Business  Transformation  Infrastructure 
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ment  termed  the  Business  Operating 
Environment  (BOE).  The  BOE  leverages 
industry  best  practices  to  federate  techni¬ 
cal  architectures,  develop  capability 
requirements,  and  support  the  delivery  of 
portfolios  of  business  capabilities  based 
on  collections  of  atomic  and/or  compos¬ 
ite  service  orchestrations.  The  BOE  is 
defined  in  [2],  which  details  the  infrastruc¬ 
ture  component  of  the  BOE  and  the  busi¬ 
ness  transformation  infrastructure  (BTI), 
shown  in  Figure  1  (on  the  previous  page). 
Some  functions  of  the  BTI  will  be  met 
through  and  built  upon  the  DoD  Global 
Information  Grid  Enterprise  Services. 
The  technical  core  of  the  BTI,  designated 
the  Business  Transformation  Engine 
(BTE),  is  to  be  built  from  commercially 
available  products. 

To  assess  the  feasibility  of  this  strate¬ 
gy,  the  BMA  CTO  conducted  market 
research  into  maturity  and  readiness  to 
support  this  strategy  in  SOA  technologies 
from  more  than  30  organizations.  These 
had  survived  a  preliminary  screening  to 
ensure  that  they  were  realistic  and  rele¬ 
vant.  The  technical  research  included  all 
components  of  the  BTE,  and  was  con¬ 
ducted  in  accordance  with  departmental 
regulations  guiding  pre-acquisition  market 
research.  The  organizations  provided  live 
demonstrations  of  their  development, 
test,  operational,  and  production  environ¬ 
ments.  CIO  offices  from  each  military  ser¬ 
vice  and  many  defense  agencies  were 
invited  to  attend  and  participate  in  the 
presentations.  This  article  provides 
research  conclusions  across  the  BTE  com¬ 
ponents  (numbered  in  Figure  1),  as  well  as 
SOA  information  assurance  and  gover¬ 
nance. 

Industry  Readiness  to  Support 
Key  BTI  Capabilities 

The  research  approach  gathered  data  to 
correspond  to  key  technical  capabilities 
required  to  build  the  BTI  and  included  an 
assessment  of  the  industry’s  maturity  in 
providing  tools  to  support  these  technical 
capabilities.  The  research  did  not  consider 
SOA  technologies  not  relevant  to  the  BTI. 
In  this  section,  we  present  the  assessment. 

Interoperability  Controller 

The  interoperability  controller  component 
of  the  BTI  is  a  pattern  or  foundation 
architecture  for  brokering,  routing,  and 
processing  messages  and  service  invoca¬ 
tions  within  an  SOA.  It  consists  of  an 
extensible  set  of  integration  brokers  inter¬ 
connected  on  the  network  by  robust  mes¬ 
saging  middleware.  The  research  looked  at 
products  supporting  this  pattern,  examin¬ 


ing  them  for  a  number  of  characteristics, 
including  support  for  indirection  and 
interception,  loose  coupling,  scalability, 
and  robustness.  In  general,  the  products 
that  most  closely  support  this  pattern  are 
enterprise  service  bus  (ESB)  products,  as 
well  as  enterprise  application  integration 
and  message-oriented  middleware  through 
composition. 

The  market  research  shows  that  the 
state  of  industry  products  as  reasonably 
mature  and  can  support  the  implementa¬ 
tion  of  the  BMA  vision  for  the  BTI’s 
interoperability  controller  component. 
The  message-oriented  middleware  and 
enterprise  application  integration  product 
vendors  have  been  working  in  this  direc¬ 
tion  through  many  generations  of  prod- 

“The  challenge  ...  is  to 
build  a  standards-based 
SOA  that  leverages  the 
success  of  Web 
technologies  rather  than 
an  ESB-based  solution 
that  provides  some 
aspects  of  SOA  but 
could  lead  to 
over-dependence  on  a 
particular  vendor's 
technology.^ 

ucts.  The  ESB  vendors  have  built  on  this 
experience  to  provide  an  enterprise-wide 
solution,  though  often  with  proprietary 
features.  The  challenge  with  the  latter  is  to 
build  a  standards-based  SOA  that  lever¬ 
ages  the  success  of  Web  technologies 
rather  than  an  ESB-based  solution  that 
provides  some  aspects  of  SOA  but  could 
lead  to  over-dependence  on  a  particular 
vendor’s  technology. 

Mediation  -  Standard  and  High 
Volume 

While  increasingly  more  BMA  systems 
and  data  sources  will  communicate 
natively  in  terms  of  standard  message 
sets  and  vocabularies,  there  is  a  short¬ 
term  need  for  mediation  of  information 
exchanges,  translating,  and  transforming 
messages  between  information  providers 


and  consumers.  The  research  found  good 
support  of  this  pattern  in  both  the  stan¬ 
dard  and  high-volume  variations.  Many 
vendors,  (such  as  Fiorano,  BEA  Systems, 
IBM,  Iona,  Tibco,  and  webMethods)  are 
producing  capable  transformation 
engines,  especially  those  focused  on 
extensible  Markup  Language  (XML) 
messaging  and  the  use  of  Extensible 
Stylesheet  Language  Transformation 
engines.  For  high  volume,  Ab  Initio,  with 
its  advanced  parallel  processing  capabili¬ 
ties,  allows  for  the  development  of  high- 
performance,  straight-through  mediation 
services.  However,  the  vision  for  dynam¬ 
ic  generation  of  transformations  on  a 
semantic  mediation  basis  was  found  to 
still  be  a  future  capability.  The  semantic 
technology  needed  is  immature,  so 
semantic  tools  from  companies  like 
Revelytix  and  IBM  are  best  suited  for 
supporting  development  time  activities, 
with  semantics  being  early-bound  into 
runtime  environments. 

Service  Discovery  and  Metadata 
Registries 

The  BMA’s  approach  to  SOA  calls  for 
metadata  registries  and  repositories  sup¬ 
porting  the  discovery  of  services  and 
information  assets.  DoD  registries  are 
built  around  Organization  for  the 
Advancement  of  Structured  Information 
standards,  such  as  Universal  Description 
and  Discovery  Interface  (UDDI)  and  elec¬ 
tronic  business  XML  (ebXML)  Registry 
Information  Model  and  Registry  Services 
(including  a  UDDI  service  registry),  a 
Metadata  Registry  that  contains  the  DoD’s 
structural  and  semantic  metadata,  and  an 
enterprise  catalog  containing  DoD  specifi¬ 
cation  metadata  to  support  discovery  of 
information  assets.  Given  the  DoD’s  size 
and  the  likely  need  to  federate  registries, 
the  BMA  included  this  category  in  the 
market  research. 

Many  of  the  vendors  in  the  market 
research  provide  UDDI  service  registries, 
notably  Systinet,  now  a  part  of  HP.  Many 
vendors  include  UDDI  capability  (e.g., 
IBM,  BEA,  Software  AG),  with  a  number 
of  vendors  using  Systinet.  Many  vendors 
also  include  metadata  management  capa¬ 
bilities  and  repository  components  (e.g., 
Fiorano,  Lombardi),  while  others  such  as 
Revelytix  specialize  around  semantic 
metadata.  The  DoD’s  metadata  discovery 
specification  is  not  directly  supported  by 
vendors,  though  those  that  support  the 
ebXML  architecture  can  act  as  enterprise 
catalog  instances. 

Business  Activity  Monitoring 

Business  activity  monitoring  (BAM) 
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allows  management  of  an  SOA  in  busi¬ 
ness  terms.  Market  research  found  that 
BAM  is  still  in  early  development.  There 
are  many  vendors  providing  BAM  func¬ 
tionality  coming  from  diverse  industry 
segments.  Application  integration  and 
enterprise  software  vendors  (BEA, 
Fiorano,  IBM,  IONA  Technologies, 
Microsoft,  Oracle,  SAP,  TIB  CO,  web- 
Methods,  etc.)  are  extending  existing 
assets  and  acquiring  additional  capabili¬ 
ties  in  order  to  support  BAM.  Business 
intelligence  vendors  (Business  Objects, 
Cognos,  Software  AG,  etc.)  are  working 
to  adapt  technology  and  incorporate 
business  rules  engines  into  their  solu¬ 
tions  to  support  real-time  BAM  opera¬ 
tions.  There  is  also  a  set  of  pure-play 
BAM  providers  who  focus  on  complete 
BAM  solutions.  Overall,  the  research 
found  little  standardization  across  ven¬ 
dor  implementations,  making  true  inter¬ 
operability  difficult  to  achieve  on  an 
enterprise  level.  The  most  common  uses 
found  in  the  research  revolve  more 
around  project-based,  application-spe¬ 
cific  uses  rather  than  as  general  enter¬ 
prise  infrastructure. 

Enterprise  Services  Management 

Enterprise  services  management  (ESM) 
provides  for  managing  the  service  life 
cycle  and  is  the  foundation  for  SOA  run¬ 
time  governance.  Market  research  found 
a  limited  number  of  SOA  ESM  vendors. 
The  main  vendors  (e.g.,  IBM,  Hewlett- 
Packard)  possess  strong  portfolios  in 
traditional  network  management  and 
integrated  service  management  markets 
that  they  have  extended  to  ESM.  Most 
of  the  tools  researched  include  feature 
sets  spanning  the  range  from  low-level 
IT  service  management  to  the  higher- 
level  business  management  needs,  and 
the  differences  are  more  in  terms  of 
focus.  Often,  a  more  comprehensive 
solution  can  be  composed  by  combining 
products  (e.g.,  AmberPoint  and  HP 
OpenView  SOA  Manager). 

Business  Process  and  Workflow 
Automation  (Business  Process 
Modeling,  Execution,  and  Monitoring) 

The  BTI  must  provide  for  the  modeling 
and  execution  of  business  processes 
through  the  orchestration  of  services, 
and  the  monitoring  of  those  business 
processes.  While  the  research  found  that 
there  are  still  many  proprietary  modeling 
offerings,  there  is  considerable  conver¬ 
gence  around  Business  Process 
Modeling  Notation  (BPMN).  The 
research  also  found  strong  support 
across  vendors  for  the  Business  Process 


Execution  Language  standard,  though 
there  is  also  emerging  support  for  direct 
execution  of  BPMN  through  the  use  of 
the  XML  Process  Definition  Language, 
an  XML  serialization  of  BPMN.  Many 
vendors  also  provide  the  needed  moni¬ 
toring  of  those  processes  at  runtime, 
often  building  on  extensive  experience 
with  network  and  application  monitor¬ 
ing  capabilities.  Still  further  in  the  future 
are  tools  with  semantic  continuity  from 
modeling  to  execution  in  the  business 
process  arena;  however,  the  research  did 
find  that  what  already  exists  is  maturing 
rapidly,  and  can  provide  a  base  for 
implementing  the  BTI.  Perhaps  surpris¬ 
ingly,  not  a  single  vendor  included  the 
Unified  Modeling  Language  in  either  its 
list  of  product  offerings  relative  to  an 
SOA,  or  as  a  tool  that  it  uses  in  its  SOA 
engagements. 


"...  the  DoD  is 
making  IA  services 
a  part  of  the 
Net-Centric  Core 
Enterprise  Services  so 
that  security  is 
ubiquitous,  well-tested, 
and  a  part  of 
the  infrastructure V 


Data  Virtualization  and  Data 
Services 

Among  virtualization  trends,  virtualizing 
data  sources  has  emerged  as  a  real-world 
capability,  and  is  a  key  component  of  the 
BMA  SOA  vision  in  which  a  virtual  data 
store  makes  information  from  many 
sources  available  in  real  time  without  a 
physical  store.  The  vendors  include 
Composite  Software,  Red  Hat  Meta¬ 
matrix,  IBM,  and  Streambase. 

The  BMA  research  found  that  over¬ 
all  data  virtualization  and  associated  data 
services  have  matured  to  the  point  that 
there  are  many  cases  where  they  can 
produce  high-performance  and  robust 
data  sources  and  services  to  be  used  in  a 
net-centric  environment,  significantly 
reducing  the  latency  in  data  availability 
to  business  analysts  and  decision  makers 
who  do  not  need  to  wait  for  the  period¬ 
ic  load  of  a  data  warehouse  or  data  mart. 


Information  Assurance  for 
SOA 

An  SOA  introduces  new  information 
assurance  (IA)  challenges.  The  interop¬ 
erability  and  extended,  net-centric  data 
sharing  capabilities  enabled  by  SOAs  are 
themselves  potential  points  of  vulnera¬ 
bility.  A  compromised  service  registry 
provides  an  attacker  with  a  detailed  map 
of  the  operations  and  capabilities  of  an 
organization.  Standards  and  standard 
protocols  narrow  the  range  of  network 
capabilities  that  an  attacker  must  sub¬ 
vert,  and  success  wins  wide  access. 
Deploying  an  SOA  in  a  responsible  fash¬ 
ion  must  consider  the  effects  of  infor¬ 
mation  warfare  in  addition  to  other  plan¬ 
ning.  Only  through  such  IA  diligence 
will  the  DoD  be  able  to  truly  realize  the 
savings  and  benefits  that  an  SOA 
promises  for  a  large,  geographically  dis¬ 
persed  organization  that  must  operate  in 
the  face  of  the  exigencies  of  war. 
Additionally,  SOAs  must  also  meet  old 
I A  challenges  including  reliability,  avail¬ 
ability,  and  non-repudiation.  An  SOA 
does  not  relieve  implementers  of  the 
responsibility  for  solid  engineering  in 
areas  of  platforms,  networking,  back¬ 
ups,  and  auditing.  Past  best  practices  and 
standards  must  be  brought  to  bear  on 
SOA  implementations  as  well  as  tradi¬ 
tional  ones. 

As  would  be  expected,  the  DoD  is 
making  IA  services  a  part  of  the  Net- 
Centric  Core  Enterprise  Services  so  that 
security  is  ubiquitous,  well-tested,  and  a 
part  of  the  infrastructure.  An  SOA  pro¬ 
vides  the  possibility  to  externalize  secu¬ 
rity  as  a  common,  cross-cutting  set  of 
capabilities,  themselves  presented  as  ser¬ 
vices.  In  this  way,  each  application  or 
program  does  not  have  to  master  the 
complex  technical  capabilities  required, 
but  can  declaratively  define  IA  require¬ 
ments  and  expect  them  to  be  honored 
and  enforced  by  the  infrastructure.  At 
the  same  time,  DoD-level  IA  policy  can 
be  enforced  on  SOA  operations,  includ¬ 
ing  authorization  control,  redaction,  and 
auditing.  An  SOA  must  also  work  with 
the  DoD’s  Public  Key  Infrastructure  to 
enable  secure  single  sign-on,  and  to 
ensure  preservation  of  appropriate  non¬ 
repudiation  characteristics  as  people  and 
systems  take  action  against  DoD  and 
BMA  data  assets. 

The  BTI  is  intended  to  embrace  and 
extend  the  DoD  SOA  and  IA  foundations. 
During  the  research,  the  BMA  team  stud¬ 
ied  vendor  capabilities  with  regard  to  I A 
and  security  in  a  number  of  areas.  In  par¬ 
ticular,  there  was  support  for  emerging 
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Web  Services  security  standards  and  the 
inclusion  of  IA  capabilities,  both  for 
enabling  IA  and  for  working  with  an 
enterprise’s  existing  IA  infrastructure. 

•  Support  for  Web  Services  IA 
Standards.  Vendor  support  for  the 
Web  Services  standards  stack  (WS-*) 
and  related  sets  of  XML  and  network 
IA  standards — such  as  the  WS- 
Security  Assertion  Markup  Language, 
and  the  extensible  Access  Control 
Markup  Language — is  maturing  rapid¬ 
ly  along  with  the  standards  them¬ 
selves.  These  standards  are  key  to 
moving  IA  into  the  infrastructure,  the 
SOA  foundation,  and  enabling  a 
declarative  IA.  Most  of  the  deep 
stacks  of  SOA  capability,  such  as 
those  from  IBM,  BEA,  Oracle,  and 
Microsoft,  have  incorporated  these 
standards  throughout. 

•  Enabling  I A  Infrastructure  Capa¬ 
bilities.  Some  organizations  includ¬ 
ed  in  the  research  (such  as 
AmberPoint)  focus  explicitly  on  pro¬ 
viding  SOA  security  capabilities.  The 
market  research  found  that  there  is  a 
trend  to  make  IA  an  integral  part  of 
SOA  through  provisioning,  gover¬ 
nance,  and  key  infrastructure,  such  as 
with  the  BTI’s  interoperability  con¬ 
troller.  This  holds  out  the  promise 
that  as  an  SOA  is  implemented  in  the 
BMA,  it  will  not  prove  to  be  the  soft 
and  chewy  inside  of  a  hard  and  crunchy 
perimeter  defense. 

•  Integration  With  Existing  IA 
Infrastructure.  DoD  IA  must  be  a 
consideration  from  the  beginning  of 
the  life  cycle.  An  SOA  must  be  able 
to  work  and  interoperate  with  IA 
standards,  practices,  and  approaches 
developed  during  the  DoD  and  U.S. 
intelligence  community’s  long  experi¬ 
ence  in  producing  networked  IT  sys¬ 
tems  to  provide  defense  in  depth. 
The  market  research  found  that  there 
is  a  convergence  in  this  arena,  with 
the  DoD  looking  to  adopt  industry 
and  commercial  best  practices  in  IA 
for  its  solutions,  and  SOA  vendors 
(included  in  the  research)  willing  to 
meet  and  accommodate  the  stringent 
I A  requirements  of  the  DoD. 

Governance 

Governance — the  means  to  assure  that 
laws,  regulations,  and  policies  are  met  in 
IT  operations  and  investments — is  of 
key  importance  for  the  move  to  an  SOA. 
An  SOA  introduces  new  challenges  for 
IT  and  business  governance  due  to  solu¬ 
tions  composed  from  numerous  distrib¬ 
uted  services  in  an  environment  of  het¬ 


erogeneous  ownership  and  control,  and 
by  enabling  widespread  sharing  of  infor¬ 
mation  and  capabilities.  The  BMA  strat¬ 
egy  for  SOA  governance  addresses  both 
buildtime  and  runtime  needs. 

Buildtime  (Investment)  Governance 

The  research  assessed  buildtime  gover¬ 
nance  in  the  following  areas: 

•  Enterprise  Architecture  Satisfac¬ 
tion.  The  research  found  that  enter¬ 
prise  architecture  tools  are  moving  to 
explicitly  model  services,  such  as 
those  from  Mega  Software  or  IBM 
(Telelogic).  However,  these  tools 
have  (at  most)  limited  interoperabili¬ 
ty  with  tools  used  to  design  and 
develop  services.  These  tools  also 

“While  serious  caution 
remains  in  the  areas  of 
IA  and  security  ...  the 
need  for  significant 
cultural  change  for 
successful  SOA 
implementation  cannot 
be  overemphasized ...” 

provide  little  in  the  way  of  automat¬ 
ed  compliance  checking  or  manage¬ 
ment  of  the  transition  between 
enterprise  architecture  models  and 
service  designs  and  implementations. 

•  Duplication  Avoidance.  The  re¬ 
search  found  that  this  aspect  of  gover¬ 
nance  is  provided  largely  by  the  ability 
of  SOA  development  tools  and  envi¬ 
ronments  to  access  service  registries 
and  repositories.  This  allows  develop¬ 
ers  to  determine  whether  an  imple¬ 
mentation  for  their  service  already 
exists.  Additional  metadata  repository 
capabilities  (providing  further  infor¬ 
mation)  support  this  process. 

•  Service  Usage.  The  market  research 
found  that  the  main  mechanisms  for 
assuring  that  existing  services  are 
used  as  appropriate  are  through 
development  tools  that  integrate 
with  an  enterprise’s  service  registries 
and  repositories.  These  tools  provide 
developers  with  service  descriptions 
and  specifications  at  design  and  build 
time.  Many  tool  vendors,  such  as 
Lombardi  and  IBM,  provide  this 
capability. 


•  Service  Verification.  The  market 
research  found  that  there  is  good 
support  for  test  and  verify  SOA  ser¬ 
vices — against  functional  require¬ 
ments  and  service  level  agreements 
(SLAs) — when  combined  with  more 
traditional  automated  testing  tools. 

•  SLA  Development.  The  market 
research  found  support  for  capturing 
SLAs,  but  support  for  the  actual  ini¬ 
tial  development  of  the  SLAs  is 
more  limited.  System  architects  and 
designers  need  to  pay  close  attention 
to  how  they  develop  SLAs  and  trans¬ 
late  them  into  digital  form  for  use  by 
automated  SLA  management  capa¬ 
bilities. 

Runtime  (Operations)  Governance 

Runtime  governance  should  provide  vis¬ 
ibility  into  service  operation  allowing 
management  of  services,  the  ability  to 
take  corrective  action  (as  needed)  to 
ensure  effectively  uninterrupted  business 
operations,  and  the  capture  of  operation 
audit  information.  Provisioning,  deploy¬ 
ing  new  services,  and  taking  old  services 
out  of  operation  without  significant 
impact  on  business  activities  or  overall 
operations,  are  key  parts  of  overall  run¬ 
time  governance.  Characteristics  looked 
for  in  runtime  governance  include  the 
following: 

•  Operational  Visibility.  Make  the 
runtime  state  visible  in  both  techni¬ 
cal  (network  and  machine  usage)  and 
business  terms. 

•  Service  Management.  Monitor 
and  manage  the  execution  and  oper¬ 
ation  of  services  in  an  SOA. 

•  Policy  Enforcement.  Enforce  secu¬ 
rity  and  other  policy-based  con¬ 
straints  in  a  declarative  fashion, 
external  to  SOA  services,  allowing 
systems  to  adapt  quickly  to  changing 
policy  circumstances  without  coding. 

•  Auditing.  Track  and  record  key 
events  and  actions  within  the  SOA 
environment  for  later  analysis. 

•  Provisioning  and  Configuration 
Management.  Provision  services 
for  deployment  in  the  SOA  and  track 
its  configuration  across  changes  as 
they  occur. 

Governance  Conclusions 

The  market  research  found  no  complete 
solution  available  as  a  single  package, 
but  there  is  considerable  governance 
capability  available  in  the  marketplace. 
For  example,  in  the  area  of  provisioning 
and  configuration  management,  the 
research  found  that  SOA  management 
tools  provide  some  of  this  capability, 
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but  may  need  to  be  joined  with  more  tra¬ 
ditional  configuration  management  and 
deployment  tools  for  reasonable  capabil¬ 
ity.  Governance  capability  (as  required 
by  the  BMA  strategy  for  SOA)  can  be 
provided  through  commercial  tools,  but 
designers  must  carefully  assess  and 
acquire  the  components  from  various 
vendors  in  accordance  with  a  strong 
design  and  plan  that  they  must  create  for 
themselves. 

Conclusion 

The  DoD  BMA  has  embarked  on  an  SOA 
strategy.  The  “BMA  Architecture  Feder¬ 
ation  Strategy  and  Roadmap”  provides 
guidance  for  the  DoD  BMA  to  quickly 
gain  business  value  by  delivering  capabili¬ 
ty  to  support  the  warfighter  through  an 
SOA,  while  using  a  phased  approach  for 
transforming  legacy  systems.  The  mar¬ 
ket  research  performed  by  the  BMA 
Office  of  the  CTO  and  CA  has  found 
that  industry  capabilities  to  implement 
or  enable  the  components  defined  in  the 
BMA  Service-Oriented  Infrastructure 
have  matured  in  the  marketplace.  While 
serious  caution  remains  in  the  areas  of 
IA  and  security,  and  the  need  for  signif¬ 
icant  cultural  change  for  successful  SOA 
implementation  cannot  be  overempha¬ 
sized,  it  is  clear  that  it  is  feasible  for  an 
enterprise  the  size  of  the  DoD  to  move 
forward  on  implementing  an  SOA  and 
to  realize  the  business  benefits  of  agility, 
interoperability,  and  net-centric  data 
sharing  that  an  SOA  provides. 

The  opinions  expressed  in  this  article 
are  those  of  the  authors  only  and  in  no 
way  constitutes  the  policy  or  express 
direction  of  the  DoD.  For  additional 
information  about  the  vendors,  see  the 
online  version  of  this  article. ♦ 

Note 

1.  According  to  the  U.S.  Government 
Accountability  Office,  “A  service- 
oriented  architecture  is  an  approach 
for  sharing  functions  and  applica¬ 
tions  across  an  organization  by 
designing  them  as  discrete,  reusable, 
business-oriented  services.  These 
services  need  to  be,  among  other 
things,  (1)  self-contained  ...;  (2)  pub¬ 
lished  and  exposed  as  self-describing 
...;  and  (3)  subscribed  to  via  well- 
defined  and  standardized  interfaces.” 
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Fault  Handling  and  Fault  Tolerance 

www.eventhelix.com/RealtimeMantra/FaultHandling 
EventHelix.com  Inc.  has  several  articles  covering  software  and 
hardware  fault  handling  and  fault  tolerance  techniques.  Along 
with  articles  on  the  basics  of  hardware  and  software  as  well  as  a 
description  of  the  fault  handling  life  cycle,  you’ll  find  tech¬ 
niques  for  making  systems  more  tolerant  to  software  bugs, 
reboot  and  recovery,  hardware  redundancy,  measuring  and 
improving  computer  system  reliability,  and  calculating  system 
availability. 

Fraunhofer  Center  -  Maryland  (FC-MD) 

http://fc-md.umd.edu/fcmd 

The  FC-MD  is  a  non-profit  applied  research  and  technology 
transfer  organization  with  the  mission  to  advance  the  state-of- 
the-practice  in  software  development  and  acquisition  organiza¬ 
tions  by  applying  state-of-the-art  research  results.  Along  with 
details  from  their  projects,  publications,  and  training  courses, 
you  can  learn  about  Experience  Factory,  their  unique  model  for 
better  understanding  and  managing  your  software  business. 

DO- 1 78  Industry  Group  Homepage 

www.dol78site.com 

DO-178B  is  considered  the  world’s  strictest  software  standard 
and  influences  other  domains  including  medical  devices,  trans¬ 
portation,  and  telecommunications.  The  DO-178B  Industry 


Group  is  the  world’s  largest  collection  of  avionics  companies 
and  DO-178B  avionics  product  and  services  providers  who 
share  a  common  mission  of  achieving  DO-178B  success. 

The  Common  Criteria  (CC)  Portal 

www.commoncriteriaportal.org 

Learn  more  about  the  CC  at  this  internationally  recognized 
Web  portal,  the  driving  force  for  the  widest  available  mutual 
recognition  of  secure  IT  products.  This  site  provides  informa¬ 
tion  on  the  status  of  the  CC  Recognition  Agreement,  CC  and 
certification  schemes,  licensed  laboratories,  certified  products, 
as  well  as  related  information,  news,  and  events. 

Vision  for  a  Net-Centric,  Service-Oriented 
DoD  Enterprise 

www.defenselink.mil/ cio-nii/ docs/ GIGArchVision.pdf 
The  Department  of  Defense  (DoD)  is  transforming  to  become 
a  service-oriented,  net-centric  force.  Last  year,  the  DoD’s  Global 
Information  Grid  (GIG)  put  out  a  detailed  architectural  vision 
with  the  goal  of  promoting  a  unity  of  effort  in  reaching  their 
target  state  for  the  development  of  GIG  capabilities  that  will 
support  future  DoD  missions,  operations,  and  functions.  They 
also  provide  a  short,  high-level,  understandable  description  of 
the  DoD’s  objective  enterprise  architecture. 
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www.sstc-online.org 


"Technology:  Advancing  Precision" 

20-23  April  2009  -  Salt  Lake  City  Utah 


Save  the  Date  Now,  Plan  to  Attend! 

WHO  SHOULD  ATTEND: 

•  Acquisition  Professionals  •  Systems  Engineers 

•  Program/Project  Managers  •  Process  Engineers 

•  Programmers  •  Quality  and  Test  Engineers 

•  Systems  Developers 

TOPICS  FOR  SSTC  2009 

•  Assurance  and  Security 

•  Robust,  Reliable,  and  Resilient  Engineering 

•  Policies  and  Standards 

•  Processes  and  Methods 

•  New  Concepts  and  Trends 

•  Modernizing  Systems  and  Software 

•  Developmental  Life  Cycle 

•  Estimating  and  Measuring 

•  Professional  Development  /  Education 

•  Lessons  Learned 

•  Competitive  Modeling 
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BackTalk 

Whose  FAULT 

ay  back  in  the  1980s,  I  was  introduced  to  Moore’s  Law1  in 
an  engineering  class.  It  says  that  computing  power  doubles 
approximately  every  18  months  to  two  years.  This  same  law 
appears  to  apply  to  many  computer-related  items  such  as  pro¬ 
cessing  speed,  memory  capacity,  and  even  digital  camera  resolu¬ 
tion.  There’s  a  lot  of  similar  laws  for  disk  size,  power  consump¬ 
tion,  network  capacity,  etc.  Moore’s  Law  has  held  true  since  its 
origination  in  1965,  and  will  probably  hold  true  for  at  least  the 
next  decade.  Every  few  years  engineers  will  say  that  we’ve 
reached  the  limit  of  Moore’s  Law,  but  new  technology  keeps 
proving  it  true.  The  bottom  line  is  that  technology  grows  at  an 
exponential  rate. 

My  first  computer,  a  Commodore  SuperPet  that  I  bought  back 
in  1982,  had  a  whopping  128  kilobytes  (KB)  of  memory.  I  don’t 
recall  the  clock  speed,  but  it  was  a  relatively  slow  6502  processor 
that  I  believe  was  at  about  1  megahertz.  (As  an  historical  point,  the 
Commodore  SuperPet  also  had  a  6809  processor,  and  you  could 
run  dual  operating  systems  and  interpreters  for  Pascal,  APL,  FOR¬ 
TRAN,  COBOL,  plus  the  obligatory  BASIC  interpreters). 

Twenty- six  years  later,  my  current  laptop  has  three  gigabytes 
(GBs)  of  memory.  This  pretty  much  follows  Moore’s  Law,  as 
does  the  processing  speed  of  my  current  machine,  a  two  giga¬ 
hertz  (dual  processor)2. 

It  appears  that  Moore’s  Law  also  applies  to  the  size  of  the 
operating  system.  Remember  MS-DOS  2.11?  Back  in  1983,  it 
loaded  in  64KB — and  left  you  room  to  run  your  programs! 
Windows  95  (12  years  later)  took  50  megabytes  (MB)  of  disk 
space.  And  now  with  Windows  Vista,  Microsoft  says  you  need 
15GB  of  free  disk  space  and  512MB  of  memory  (still  following 
Moore’s  Law). 

What’s  the  point,  you  ask?  I’m  not  Microsoft  bashing.  I  like  my 
operating  system.  I  find  it  useful  to  run  multiple  applications, 
tons  of  sidebar  gadgets,  high-resolution  graphics,  and  have  music 
playing  in  the  background. 

However,  with  the  increased  program  size,  did  you  know  that 
the  chance  for  failure  goes  up,  too?  I  know  that  Vista  is  a  pretty 
solid  operating  system.  I  haven’t  had  a  single  blue  screen  of 
death,  and  only  about  three  needed  hard  shutdown  and  reboot 
occasions  since  I  bought  it  last  year  (as  opposed  to  about  three- 
a-day  from  my  first  experiences  with  Windows  95,  if  I  recall).  I’m 
talking  about  the  chance  of  failure  that  comes  from  large-scale 
reliance  on  the  compliance  of  others.  And  reliance  on  the  rapid¬ 
ly  expanding  technology  raises  the  potential  for  problems. 

In  our  office,  we  have  a  one  terabyte  (TB)  four-disk  RAID 
(redundant  array  of  inexpensive  disks,  as  named  by  the  inventor,  or 
occasionally  known  as  redundant  array  of  independent  disks)3  cluster. 

It  has  the  very  reasonable  name  of  “Terrabyte4.”  If  any  two  of  the 
four-disk  cluster  fails,  we  still  have  a  complete  set  of  data.  Until  an 
“unexpected”  failure  took  it  totally  down.  Our  domain  name  sys¬ 
tem  server  died — and  with  it  down,  Terrabyte  was  unable  to  grab 
an  address.  It  took  us  a  bit  of  time  to  locate  the  problem,  and  fig¬ 
ure  out  how  to  reconfigure  it.  Of  course,  eventually,  we  realized 
that  we  could  just  plug  directly  into  a  computer.  Then  we  realized 
that  the  permissions  on  the  file  access  were  based  on  domain 
authentication;  so  even  though  I  could  plug  the  device  directly  into 
my  computer,  it  couldn’t  authenticate  the  access.  Sure,  it  was  fix- 
able,  but  the  delay  cost  several  of  us  a  bit  of  work.  And,  I  admit, 
there  was  a  bit  of  momentary  panic  when  somebody  asked,  “Just 
in  case,  we  do  have  a  backup  of  it,  don’t  we?” 

We  all  have  become  dependent  upon  the  increasing  complex- 


Is  It,  Anyway? 

ity  of  new  technology.  And  when  the  technology  fails,  we  all  feel 
powerless.  It’s  not  like  any  of  us  can  keep  four  or  five  different 
backups  around  on  floppy  anymore — backing  up  a  TB  RAID 
cluster  takes  some  serious  storage! 

The  point  is  that  increased  power,  increased  memory,  and 
increased  disk  storage  bring  increased  PPoF  (Potential  Points  of 
Failure)5.  And  you  need  to  plan  for  these  failures. 

Are  you  developing  large-scale  applications?  Have  you  con¬ 
sidered  what  to  do  in  case  the  network  fails?  The  database  fails? 
How  many  backups  do  you  have?  Where  are  the  backups  located 
— having  them  in  the  same  location  really  won’t  help  in  case  of 
fire  or  flood,  will  it?  Whatever  technology  you  implement,  even¬ 
tually  one  of  your  users  will  run  into  a  case  where  something 
goes  bad,  and  they  are  going  to  expect  you  to  have  thought  of 
the  potential  problem,  and  developed  a  contingency  plan  for  it! 

Technology  lures  you  in — like  when  you’re  stuck  in  the  airport, 
flight  cancelled,  you  need  to  re-book,  and  you  realize  your  cell 
phone  is  out  of  juice.  Backup?  Tried  to  find  a  pay  phone  lately? 
Kind  of  makes  you  long  for  the  days  when  a  spare  deck  of  cards 
in  your  desk  took  care  of  your  backup  needs. 

Speaking  of  faults,  this  column  was  almost  late  because  the  e- 
mail  from  the  CROSSTALK  editors  reminding  me  my  article  was 
due  was  somehow  misdirected  into  my  junk  mail  folder.  I  hesi¬ 
tate  to  state  how  great  my  life  would  be  if  the  other  99  percent 
of  my  daily  e-mail  was  similarly  (but  faultily)  misdirected.  If  only 
Outlook  had  an  “I.Q.  filter,”  similar  to  caller  ID.  Then,  when 
folks  complained  that  I  never  responded  to  their  e-mail,  I  could 
say  “Honest,  it’s  not  my  fault!” 

— David  A.  Cook,  Ph.D. 

The  AEgis  Technologies  Group,  Inc. 

dcook@aegistg.com 

Notes 

1 .  Wikipedia.  <http: / / en.wikipedia.org/ wiki/Moore%27 s_  law> . 
And  before  anybody  corrects  me,  yes,  I  know  that  Moore’s 
Law  originally  referred  to  the  number  of  transistors  on  a  chip. 

2.  See  <http: / / nano-taiwan.sinica.edu.tw/ 2008_WinterSchool 
/index/Moore%27slaw%20graph2.gif>  for  an  image  of  the 
growth  in  Intel  Processors. 

3.  Wikipedia  again. 

4.  Why  the  extra  “r”?  Because  my  granddaughter  is  named 
Terra,  I  love  her,  and  my  office  was  foolish  enough  to  let  me 
name  our  RAID  cluster. 

5.  Yes,  I  made  this  one  up! 

Can  You  BACKTALK? 

Here  is  your  chance  to  make  your  point  without  your  boss 
censoring  your  writing.  In  addition  to  accepting  articles  that 
relate  to  software  engineering  for  publication  in  CROSSTALK, 
we  also  accept  articles  for  the  BackTalk  column.  These  arti¬ 
cles  should  provide  a  concise,  clever,  humorous,  and  insight¬ 
ful  perspective  on  the  software  engineering  profession  or 
industry  or  a  portion  of  it.  Your  BackTalk  article  should  be 
entertaining  and  clever  or  original  in  concept,  design,  or  deliv¬ 
ery,  and  should  not  exceed  750  words. 

For  more  information  on  how  to  submit  your  BackTalk 
article,  go  to  <www.stsc.hill.af.mil>. 
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Current  Readiness 

Contribute  to  delivering  Naval  aviation  units  ready  for  tasking 
with  the  right  capability,  at  the  right  time,  and  the  right  cost. 

Future  Capability 

Deliver  new  aircraft,  weapons,  and  systems  on  time  and 
within  budget,  that  meet  Fleet  needs  and  provide  a 
technological  edge  over  our  adversaries. 

People 

Develop  our  people  and  provide  them  with  the  tools, 
infrastructure,  and  processes  they  need  to  do  their  work. 
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